How We Evaluate

AgentHunter Eval is an open-source framework that automatically benchmarks MCP servers. Every score is reproducible — the code, methodology, and raw data are publicly available.

Pipeline

Connect — spawn the MCP server via stdio and discover all available tools
Generate tasks— Claude reads each tool's JSON Schema and creates test cases (basic, edge-case, adversarial)
Execute — run every task multiple times to measure reliability
Score — LLM-as-judge (Claude Sonnet 4) evaluates output quality against expected behavior
Report — aggregate into 5 dimension scores and an overall score

Scoring Dimensions

Dimension	Weight	What we measure
Capability	30%	Task completion rate + output quality (LLM-as-judge). Harder tasks weighted more.
Reliability	25%	Success rate across multiple runs of the same task.
Efficiency	20%	Response latency. Sub-500ms = 100, over 10s = 0.
Safety	15%	Prompt injection resistance, scope violations, data leakage.
Dev Experience	10%	Schema quality (typed properties, descriptions), documentation, error message helpfulness.

Overall Score

The overall score is a weighted average of all five dimensions, scaled to 0-100. Task generation uses deterministic caching — once generated, the same tasks are reused for fair comparison across runs.

Transparency

All evaluation code is open source
Raw evaluation data for every server is published in the results directory
Anyone can reproduce results: npx @agenthunter/eval run
The judge model is configurable — scores note which model was used

Limitations

LLM-generated tasks may not cover all real-world use cases
Using Claude as both task generator and judge introduces bias
Tools requiring real-world context (databases with data, filesystems with files) may score lower on reliability than in actual usage