Artificial Analysis Coding Agent Index adopts DeepSWE
Artificial Analysis has updated its Coding Agent Index by replacing SWE-Bench Pro with Datacurve's DeepSWE to measure the performance, speed, and cost of AI coding agent stacks. By using DeepSWE's 113 repository-wide tasks, the index aims to address the limitations of older, single-file benchmarks prone to overfitting.
Static coding benchmarks are losing their reliability as models overfit to their test sets, making the integration of multi-file, game-resistant evaluations like DeepSWE essential. Replacing SWE-Bench Pro indicates a growing industry shift away from contaminated or easily gamed benchmarks towards repository-wide exploration and behavioral verification that tests long-horizon capabilities. Combining task success rates with operational metrics like speed, token usage, and cost provides a more holistic view of an agent's practical business utility.
DISCOVERED
4d ago
2026-06-12
PUBLISHED
4d ago
2026-06-12
RELEVANCE
AUTHOR
theo