AI Agent Evaluation

Independent evaluation platform for AI agents and their tools. Task completion scoring for agents. Quality benchmarks for MCP servers.

Agent Task Rankings

Agents evaluated on 5 standardized coding tasks: CLI creation, bug fixing, data analysis, test writing, and code refactoring.

AgentPass RateAvg Time
🥇 Claude Opus 4.610/109.4s
🥈 Claude Haiku 4.59/103.9s
Claude Sonnet 4.69/1010.2s

Tool Quality Rankings

12 MCP servers benchmarked across capability, reliability, efficiency, safety, and developer experience.

#ScoreServerRelSuccess
189context7100%100%
286mcp-fetch90%90%
382mcp-memory93%93%
482notion-mcp97%97%
581mcp-datetime73%73%
675mcp-everything74%74%
771mcp-sequential-thinking100%100%
868mcp-filesystem14%14%
968playwright-mcp30%30%
1063mcp-sqlite10%10%
1155mcp-git4%4%
1247mcp-puppeteer0%0%

Scored using AgentHunter Eval v0.3.0. Click column headers to sort.