OpenAI Drops SWE-bench Verified Benchmark
OpenAI published an audit arguing that SWE-bench Verified has been undermined by two problems: flawed or underspecified tests and training-data contamination. The company says it reviewed 138 hard problems that o3 failed inconsistently across 64 runs, found that 59.4% had material test or description issues, and concluded that score gains on the benchmark increasingly reflect exposure during training rather than real coding ability. OpenAI now recommends reporting SWE-bench Pro instead and is stopping use of SWE-bench Verified scores for frontier launches.
Hot take: this is less a “benchmark is broken” story than a reminder that public evals eventually become training fodder unless they’re aggressively protected and refreshed.
- –OpenAI’s core claim is that SWE-bench Verified no longer tracks real-world software engineering capability at frontier levels because contamination can inflate scores.
- –The audit scope matters: 138 stubborn cases, reviewed by multiple experienced engineers, is enough to make the critique credible even if not exhaustive.
- –The headline number is damning: 59.4% of the audited failures had test/design problems significant enough to block correct solutions.
- –The benchmark’s slow recent improvement, from 74.9% to 80.9% over six months, is presented as evidence that the metric is saturating or becoming noisy rather than necessarily revealing genuine model progress.
- –OpenAI’s practical recommendation is to shift reporting to SWE-bench Pro and invest in privately authored, less-contaminated benchmarks.
DISCOVERED
6h ago
2026-04-26
PUBLISHED
8h ago
2026-04-26
RELEVANCE
AUTHOR
kmdupree