OPEN_SOURCE ↗
REDDIT · REDDIT// 18d agoBENCHMARK RESULT
Gemini 3.1 Pro SWE-bench score questioned
Google’s Feb. 19, 2026 Gemini 3.1 Pro update put the model near the top of SWE-bench Verified, which is why it keeps surfacing in leaderboard chatter. The Reddit thread argues that still doesn’t map cleanly to real coding, where Claude Opus 4.6 and GPT-5.4 can feel more reliable for debugging and iterative fixes.
// ANALYSIS
This reads like benchmark-maxing, not proof that Gemini is the best coding partner. SWE-bench is useful, but it rewards a very specific kind of one-shot patching that can flatter models that feel less dependable in messy, multi-turn workflows.
- –Google’s official table is for `Gemini 3.1 Pro Thinking (High)` on a single-attempt harness, and the scores are tightly bunched: 80.6% for Gemini, 80.8% for Opus 4.6, 80.0% for GPT-5.2. [Google benchmark page](https://deepmind.google/models/gemini/pro/)
- –OpenAI says SWE-bench Verified is now contaminated and recommends SWE-bench Pro instead, which is a strong sign the old leaderboard is drifting away from real coding ability. [OpenAI blog](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/)
- –The Reddit complaint matches a real workflow gap: Gemini can look strong on first-pass patch generation, but developers care more about iterative debugging, surgical rewrites, and not regressing adjacent code.
- –For long-horizon IDE or agent work, your own repo evals matter more than any single public score.
// TAGS
gemini-3-1-probenchmarkai-codingreasoningagentllm
DISCOVERED
18d ago
2026-03-24
PUBLISHED
19d ago
2026-03-24
RELEVANCE
9/ 10
AUTHOR
Additional-Alps-8209