Artificial Analysis Coding Agent Index adopts DeepSWE

// 49d agoPRODUCT UPDATE

Artificial Analysis Coding Agent Index adopts DeepSWE

Artificial Analysis has updated its Coding Agent Index by replacing SWE-Bench Pro with Datacurve's DeepSWE to measure the performance, speed, and cost of AI coding agent stacks. By using DeepSWE's 113 repository-wide tasks, the index aims to address the limitations of older, single-file benchmarks prone to overfitting.

// ANALYSIS

Static coding benchmarks are losing their reliability as models overfit to their test sets, making the integration of multi-file, game-resistant evaluations like DeepSWE essential. Replacing SWE-Bench Pro indicates a growing industry shift away from contaminated or easily gamed benchmarks towards repository-wide exploration and behavioral verification that tests long-horizon capabilities. Combining task success rates with operational metrics like speed, token usage, and cost provides a more holistic view of an agent's practical business utility.

// TAGS

agentcoding-assistantsbenchmarksartificial-analysisdeepswesoftware-engineering

DISCOVERED

49d ago

2026-06-12

PUBLISHED

49d ago

2026-06-12

RELEVANCE

8/ 10

AUTHOR

theo

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE14m ago

Twelve Apache 2.0 Models Land on Huawei Ascend

Twelve open-weight AI models covered by Apache 2.0 licenses were released on the Huawei Ascend ecosystem. While most of these models mirror existing architectures from Nvidia and Cohere rather than introducing novel designs, their arrival highlights the rapid speed at which China's domestic AI hardware platform is expanding software and model compatibility to build a self-sustaining developer ecosystem.

NEWS1h ago

OpenAI Withholds New Model Sparking Safety Debates

A recent social media update points out that a new model from OpenAI is reportedly not planned for general release, drawing parallels to earlier incidents involving restricted model deployments. The post questions OpenAI's strategy and safety considerations as public interest surrounding undisclosed or gated models continues to grow.

MODEL2h ago

Claude Opus 5 Token Inflation Slows Task Completion

Although Claude Opus 5 boasts a generation speed of 57 tokens per second—faster on paper than Fable 5—users report that it feels painfully slow for routine tasks. The core cause is token inflation rather than generation latency; the model generates far more intermediate tokens and detailed steps, particularly under high-effort configurations, leading to longer end-to-end task completion times.