BACK_TO_FEEDAICRIER_2
Gemini 3.1 Pro SWE-bench score questioned
OPEN_SOURCE ↗
REDDIT · REDDIT// 18d agoBENCHMARK RESULT

Gemini 3.1 Pro SWE-bench score questioned

Google’s Feb. 19, 2026 Gemini 3.1 Pro update put the model near the top of SWE-bench Verified, which is why it keeps surfacing in leaderboard chatter. The Reddit thread argues that still doesn’t map cleanly to real coding, where Claude Opus 4.6 and GPT-5.4 can feel more reliable for debugging and iterative fixes.

// ANALYSIS

This reads like benchmark-maxing, not proof that Gemini is the best coding partner. SWE-bench is useful, but it rewards a very specific kind of one-shot patching that can flatter models that feel less dependable in messy, multi-turn workflows.

  • Google’s official table is for `Gemini 3.1 Pro Thinking (High)` on a single-attempt harness, and the scores are tightly bunched: 80.6% for Gemini, 80.8% for Opus 4.6, 80.0% for GPT-5.2. [Google benchmark page](https://deepmind.google/models/gemini/pro/)
  • OpenAI says SWE-bench Verified is now contaminated and recommends SWE-bench Pro instead, which is a strong sign the old leaderboard is drifting away from real coding ability. [OpenAI blog](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/)
  • The Reddit complaint matches a real workflow gap: Gemini can look strong on first-pass patch generation, but developers care more about iterative debugging, surgical rewrites, and not regressing adjacent code.
  • For long-horizon IDE or agent work, your own repo evals matter more than any single public score.
// TAGS
gemini-3-1-probenchmarkai-codingreasoningagentllm

DISCOVERED

18d ago

2026-03-24

PUBLISHED

19d ago

2026-03-24

RELEVANCE

9/ 10

AUTHOR

Additional-Alps-8209