OPEN_SOURCE ↗
REDDIT · REDDIT// 8h agoBENCHMARK RESULT
OpenAI Says SWE-bench Verified Is Contaminated
OpenAI says SWE-bench Verified no longer measures frontier coding capability cleanly because many tasks have flawed tests and frontier models appear to have seen benchmark material during training. The company recommends SWE-bench Pro instead and says it has stopped reporting SWE-bench Verified scores.
// ANALYSIS
This is less a product launch than a benchmark obituary: once the leaderboard turns into train-on-the-test plus brittle test-case theater, it stops being a reliable proxy for real coding ability.
- –OpenAI audited a subset of hard tasks and found a majority had test or prompt issues that could reject correct solutions.
- –The contamination claim matters more than the headline number: if models can reproduce gold patches or task specifics from training, the score measures exposure as much as skill.
- –For developers, the practical takeaway is to treat older SWE-bench scores as legacy context, not a clean ranking signal for current frontier models.
- –The move pushes attention toward SWE-bench Pro and other newer evals, which may be better but also makes public comparison harder.
// TAGS
openaiswe-bench-verifiedswe-bench-probenchmarkresearchai-codingllm
DISCOVERED
8h ago
2026-04-26
PUBLISHED
10h ago
2026-04-26
RELEVANCE
9/ 10
AUTHOR
rm-rf-rm