BACK_TO_FEEDAICRIER_2
OpenAI Says SWE-bench Verified Is Contaminated
OPEN_SOURCE ↗
REDDIT · REDDIT// 8h agoBENCHMARK RESULT

OpenAI Says SWE-bench Verified Is Contaminated

OpenAI says SWE-bench Verified no longer measures frontier coding capability cleanly because many tasks have flawed tests and frontier models appear to have seen benchmark material during training. The company recommends SWE-bench Pro instead and says it has stopped reporting SWE-bench Verified scores.

// ANALYSIS

This is less a product launch than a benchmark obituary: once the leaderboard turns into train-on-the-test plus brittle test-case theater, it stops being a reliable proxy for real coding ability.

  • OpenAI audited a subset of hard tasks and found a majority had test or prompt issues that could reject correct solutions.
  • The contamination claim matters more than the headline number: if models can reproduce gold patches or task specifics from training, the score measures exposure as much as skill.
  • For developers, the practical takeaway is to treat older SWE-bench scores as legacy context, not a clean ranking signal for current frontier models.
  • The move pushes attention toward SWE-bench Pro and other newer evals, which may be better but also makes public comparison harder.
// TAGS
openaiswe-bench-verifiedswe-bench-probenchmarkresearchai-codingllm

DISCOVERED

8h ago

2026-04-26

PUBLISHED

10h ago

2026-04-26

RELEVANCE

9/ 10

AUTHOR

rm-rf-rm