AI Agent Evaluation

Independent evaluation platform for AI agents and their tools. Task completion scoring for agents. Quality benchmarks for MCP servers.

Agent Task Rankings

Agents evaluated on 5 standardized coding tasks: CLI creation, bug fixing, data analysis, test writing, and code refactoring.

Agent	Pass Rate	Avg Time	Create a word...	Fix the sorti...	Analyze CSV s...	Write unit te...	Refactor repe...	Write a proje...	Summarize a t...	Create a back...	Convert JSON ...	Write a proje...
🥇 Claude Opus 4.6	10/10	9.4s	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
🥈 Claude Haiku 4.5	9/10	3.9s	✓	✓	✓	✗	✓	✓	✓	✓	✓	✓
Claude Sonnet 4.6	9/10	10.2s	✓	✓	✓	✗	✓	✓	✓	✓	✓	✓

12 MCP servers benchmarked across capability, reliability, efficiency, safety, and developer experience.

#	Score↓	Server	Category	Cap	Rel	Eff	Safe	Tools	Success
1	89	context7	Search	83	100%	87	100	2	100%
2	86	mcp-fetch	Web	73	90%	99	100	5	90%
3	82	mcp-memory	Memory	63	93%	100	89	9	93%
4	82	notion-mcp	Productivity	55	97%	98	100	22	97%
5	81	mcp-datetime	Utilities	70	73%	100	100	10	73%
6	75	mcp-everything	Reference	66	74%	78	97	13	74%
7	71	mcp-sequential-thinking	Reasoning	15	100%	100	100	1	100%
8	68	mcp-filesystem	Filesystem	73	14%	100	100	14	14%
9	68	playwright-mcp	Browser	62	30%	100	100	10	30%
10	63	mcp-sqlite	Database	63	10%	100	100	5	10%
11	55	mcp-git	DevTools	40	4%	100	98	15	4%
12	47	mcp-puppeteer	Browser	51	0%	50	100	7	0%

Scored using AgentHunter Eval v0.3.0. Click column headers to sort.