Exgentic maps agent cost, performance frontier
Exgentic launches an open general-agent leaderboard and evaluation framework that compares five agent stacks across six benchmarks without environment-specific tuning. The first results show model choice drives most of the score spread, while per-task cost varies enough to materially change which stack makes sense in production.
Exgentic matters less as another leaderboard and more as an attempt to standardize how general agents get measured. The headline finding is blunt: backbone models dominate performance, but the price gap between “best” and “best value” is large enough to reshape deployment decisions.
- –Its Unified Protocol is the key technical move, letting MCP, tool-calling, and code-execution agents run against the same benchmark setup instead of requiring bespoke integrations
- –Claude Opus 4.5 pairings top raw performance, while GPT 5.2 configurations lead cost-efficiency, making the leaderboard useful for teams balancing quality against budget
- –The benchmark mix spans SWE-Bench Verified, BrowseComp+, AppWorld, and Tau2Bench domains, so it probes broader adaptability than single-domain agent leaderboards
- –Publishing the framework, paper, and live leaderboard together gives researchers and builders a shared baseline for comparing Claude Code, OpenAI Solo, Smolagent, and ReAct-style stacks
DISCOVERED
82d ago
2026-03-06
PUBLISHED
82d ago
2026-03-06
RELEVANCE
AUTHOR
Discover AI