YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

OpenAI Says SWE-bench Verified Is Contaminated

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

OpenAI Says SWE-bench Verified Is Contaminated
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

OpenAI Says SWE-bench Verified Is Contaminated

OpenAI says SWE-bench Verified no longer measures frontier coding capability cleanly because many tasks have flawed tests and frontier models appear to have seen benchmark material during training. The company recommends SWE-bench Pro instead and says it has stopped reporting SWE-bench Verified scores.

// ANALYSIS

This is less a product launch than a benchmark obituary: once the leaderboard turns into train-on-the-test plus brittle test-case theater, it stops being a reliable proxy for real coding ability.

  • OpenAI audited a subset of hard tasks and found a majority had test or prompt issues that could reject correct solutions.
  • The contamination claim matters more than the headline number: if models can reproduce gold patches or task specifics from training, the score measures exposure as much as skill.
  • For developers, the practical takeaway is to treat older SWE-bench scores as legacy context, not a clean ranking signal for current frontier models.
  • The move pushes attention toward SWE-bench Pro and other newer evals, which may be better but also makes public comparison harder.
// TAGS
openaiswe-bench-verifiedswe-bench-probenchmarkresearchai-codingllm

DISCOVERED

45d ago

2026-04-26

PUBLISHED

45d ago

2026-04-26

RELEVANCE

9/ 10

AUTHOR

rm-rf-rm