HN · HACKER_NEWS// 6h agoBENCHMARK RESULT

OpenAI Drops SWE-bench Verified Benchmark

OpenAI published an audit arguing that SWE-bench Verified has been undermined by two problems: flawed or underspecified tests and training-data contamination. The company says it reviewed 138 hard problems that o3 failed inconsistently across 64 runs, found that 59.4% had material test or description issues, and concluded that score gains on the benchmark increasingly reflect exposure during training rather than real coding ability. OpenAI now recommends reporting SWE-bench Pro instead and is stopping use of SWE-bench Verified scores for frontier launches.

// ANALYSIS

Hot take: this is less a “benchmark is broken” story than a reminder that public evals eventually become training fodder unless they’re aggressively protected and refreshed.

–OpenAI’s core claim is that SWE-bench Verified no longer tracks real-world software engineering capability at frontier levels because contamination can inflate scores.
–The audit scope matters: 138 stubborn cases, reviewed by multiple experienced engineers, is enough to make the critique credible even if not exhaustive.
–The headline number is damning: 59.4% of the audited failures had test/design problems significant enough to block correct solutions.
–The benchmark’s slow recent improvement, from 74.9% to 80.9% over six months, is presented as evidence that the metric is saturating or becoming noisy rather than necessarily revealing genuine model progress.
–OpenAI’s practical recommendation is to shift reporting to SWE-bench Pro and invest in privately authored, less-contaminated benchmarks.

// TAGS

swe-benchopenaibenchmarkcodingevaluationcontaminationsoftware-engineeringai-research

DISCOVERED

6h ago

2026-04-26

PUBLISHED

8h ago

2026-04-26

RELEVANCE

10/ 10

AUTHOR

kmdupree