REDDIT · REDDIT// 8h agoBENCHMARK RESULT

OpenAI Says SWE-bench Verified Is Contaminated

OpenAI says SWE-bench Verified no longer measures frontier coding capability cleanly because many tasks have flawed tests and frontier models appear to have seen benchmark material during training. The company recommends SWE-bench Pro instead and says it has stopped reporting SWE-bench Verified scores.

// ANALYSIS

This is less a product launch than a benchmark obituary: once the leaderboard turns into train-on-the-test plus brittle test-case theater, it stops being a reliable proxy for real coding ability.

–OpenAI audited a subset of hard tasks and found a majority had test or prompt issues that could reject correct solutions.
–The contamination claim matters more than the headline number: if models can reproduce gold patches or task specifics from training, the score measures exposure as much as skill.
–For developers, the practical takeaway is to treat older SWE-bench scores as legacy context, not a clean ranking signal for current frontier models.
–The move pushes attention toward SWE-bench Pro and other newer evals, which may be better but also makes public comparison harder.

// TAGS

openaiswe-bench-verifiedswe-bench-probenchmarkresearchai-codingllm

DISCOVERED

8h ago

2026-04-26

PUBLISHED

10h ago

2026-04-26

RELEVANCE

9/ 10

AUTHOR

rm-rf-rm