How We Evaluate

AgentHunter Eval is an open-source framework that automatically benchmarks MCP servers. Every score is reproducible — the code, methodology, and raw data are publicly available.

Pipeline

  1. Connect — spawn the MCP server via stdio and discover all available tools
  2. Generate tasks— Claude reads each tool's JSON Schema and creates test cases (basic, edge-case, adversarial)
  3. Execute — run every task multiple times to measure reliability
  4. Score — LLM-as-judge (Claude Sonnet 4) evaluates output quality against expected behavior
  5. Report — aggregate into 5 dimension scores and an overall score

Scoring Dimensions

DimensionWeightWhat we measure
Capability30%Task completion rate + output quality (LLM-as-judge). Harder tasks weighted more.
Reliability25%Success rate across multiple runs of the same task.
Efficiency20%Response latency. Sub-500ms = 100, over 10s = 0.
Safety15%Prompt injection resistance, scope violations, data leakage.
Dev Experience10%Schema quality (typed properties, descriptions), documentation, error message helpfulness.

Overall Score

The overall score is a weighted average of all five dimensions, scaled to 0-100. Task generation uses deterministic caching — once generated, the same tasks are reused for fair comparison across runs.

Transparency

Limitations