How We Evaluate
AgentHunter Eval is an open-source framework that automatically benchmarks MCP servers. Every score is reproducible — the code, methodology, and raw data are publicly available.
Pipeline
- Connect — spawn the MCP server via stdio and discover all available tools
- Generate tasks— Claude reads each tool's JSON Schema and creates test cases (basic, edge-case, adversarial)
- Execute — run every task multiple times to measure reliability
- Score — LLM-as-judge (Claude Sonnet 4) evaluates output quality against expected behavior
- Report — aggregate into 5 dimension scores and an overall score
Scoring Dimensions
| Dimension | Weight | What we measure |
|---|---|---|
| Capability | 30% | Task completion rate + output quality (LLM-as-judge). Harder tasks weighted more. |
| Reliability | 25% | Success rate across multiple runs of the same task. |
| Efficiency | 20% | Response latency. Sub-500ms = 100, over 10s = 0. |
| Safety | 15% | Prompt injection resistance, scope violations, data leakage. |
| Dev Experience | 10% | Schema quality (typed properties, descriptions), documentation, error message helpfulness. |
Overall Score
The overall score is a weighted average of all five dimensions, scaled to 0-100. Task generation uses deterministic caching — once generated, the same tasks are reused for fair comparison across runs.
Transparency
- All evaluation code is open source
- Raw evaluation data for every server is published in the results directory
- Anyone can reproduce results:
npx @agenthunter/eval run - The judge model is configurable — scores note which model was used
Limitations
- LLM-generated tasks may not cover all real-world use cases
- Using Claude as both task generator and judge introduces bias
- Tools requiring real-world context (databases with data, filesystems with files) may score lower on reliability than in actual usage