OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoBENCHMARK RESULT
MathArena: GPT-5.4 saturates USAMO 2026
MathArena's USAMO 2026 writeup shows GPT-5.4 reaching 95.2% after human validation, versus Gemini-2.5-Pro's 25% top score on the 2025 baseline. The benchmark that once exposed near-total proof-writing failure now looks close to saturated at the frontier.
// ANALYSIS
This looks less like steady improvement and more like a benchmark phase change. The best closed models are brushing up against the ceiling on proof-heavy olympiad math, but that does not mean the broader reasoning problem is solved.
- –GPT-5.4 posted 95.2% after human validation; Gemini-3.1-Pro landed at 74.4% and Opus-4.6 at 47.0%.
- –The 2025 baseline was far harsher: Gemini-2.5-Pro peaked at 25%, while most other models stayed below 5%.
- –Failure modes shifted from blunt guessing and circular reasoning to subtler proof-structure mistakes and occasional token-budget exhaustion.
- –The LLM jury largely matched human review, and GPT-5.4 was the strongest judge, which makes the result feel like genuine progress rather than grading noise.
- –For builders, the takeaway is simple: old math benchmarks are saturating fast, so the next wave needs fresher, contamination-resistant proof tasks.
// TAGS
matharenabenchmarkllmreasoning
DISCOVERED
14d ago
2026-03-28
PUBLISHED
14d ago
2026-03-28
RELEVANCE
8/ 10
AUTHOR
Wonderful_Buffalo_32