Skip to content

Commit 539fd09

Browse files
committed
e2e testing md
1 parent 7e4b835 commit 539fd09

File tree

1 file changed

+62
-0
lines changed

1 file changed

+62
-0
lines changed
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# E2E testing framework for MCP servers
2+
3+
## Objective
4+
5+
Companies like Asana, Paypal, Sentry, are hosting MCP servers in production. These companies need to know that their servers are up and running in production, and that the server is working for their customers' workflows.
6+
7+
The purpose of End to End (E2E) testing for MCP servers is to simulate customers' workflows and ensure they're returning the right results. The high level logic of an E2E test is as follows:
8+
9+
1. Developer defines an E2E test
10+
11+
```
12+
{
13+
servers: [
14+
"asana": {
15+
"command": "npx",
16+
"args": ["mcp-remote", "https://mcp.asana.com/sse"]
17+
},
18+
],
19+
test_cases: [
20+
{
21+
query: "What Asana workspace am I in?"
22+
expected: "The workspace 'MCPJam' is returned"
23+
},
24+
{
25+
query: "Create a task called 'Build E2E test'",
26+
expected: "<important>Task must be in the MCPJam workspace</important>. The task 'Build E2E test' is created"
27+
}
28+
]
29+
}
30+
```
31+
32+
2. Test are ran through an Agent. The agent connects to the MCP servers, runs through the test cases (in parallel preferrably) and the tracing is outputted.
33+
3. The trace is passed into an LLM as a judge. The judge agent will look at the trace to determine the performance and score of the E2E test.
34+
35+
### Prompt discovery test
36+
37+
The purpose of the prompt discovery test is to find out what prompts are breaking. We have an agent that looks at the tools of the MCP server and generates new queries. E2E tests will be ran on these new queries. If they're breaking, then we know that workflow is broken.
38+
39+
Prompt discovery tests are useful for discovering new workflows to test and make sure they're working. This test essentially is an edge case finder.
40+
41+
### Benchmark test
42+
43+
We want MCPJam customers to create a benchmark E2E test. Our customer would create a test definition (like example in step 1) with the most popular user queries. We would periodically run these tests to catch for any regressions in the server.
44+
45+
For example, the benchmark might be 70% of the tests pass. If that drops to 30%, then we know there's been a regression.
46+
47+
## Product spec requirements
48+
49+
### Benchmark test is in MCPJam
50+
51+
- New tab in MCPJam inspector called "Benchmark E2E tests"
52+
- User defines an E2E test in the UI. User can create an E2E for any connected server in MCPJam.
53+
- User can run the E2E test. Results and score is shown.
54+
- Display thinking and agent tracing in the UI.
55+
- We'll have the base open source version, where you can run a benchmark on any server. We'll have paid cloud features where you can save your runs and see them over time.
56+
57+
### Prompt discovery test
58+
59+
- Requirement is that it can generate new prompts and run the E2E tests on each new prompt.
60+
- Prompt discovery test will not be in MCPJam open source
61+
- We'll build this privately, offer prompt discovery E2E as a service for enterprise.
62+
- We'll manually test their MCP servers this way ourselves.

0 commit comments

Comments
 (0)