BACK_TO_FEEDAICRIER_2
MathArena: GPT-5.4 saturates USAMO 2026
OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoBENCHMARK RESULT

MathArena: GPT-5.4 saturates USAMO 2026

MathArena's USAMO 2026 writeup shows GPT-5.4 reaching 95.2% after human validation, versus Gemini-2.5-Pro's 25% top score on the 2025 baseline. The benchmark that once exposed near-total proof-writing failure now looks close to saturated at the frontier.

// ANALYSIS

This looks less like steady improvement and more like a benchmark phase change. The best closed models are brushing up against the ceiling on proof-heavy olympiad math, but that does not mean the broader reasoning problem is solved.

  • GPT-5.4 posted 95.2% after human validation; Gemini-3.1-Pro landed at 74.4% and Opus-4.6 at 47.0%.
  • The 2025 baseline was far harsher: Gemini-2.5-Pro peaked at 25%, while most other models stayed below 5%.
  • Failure modes shifted from blunt guessing and circular reasoning to subtler proof-structure mistakes and occasional token-budget exhaustion.
  • The LLM jury largely matched human review, and GPT-5.4 was the strongest judge, which makes the result feel like genuine progress rather than grading noise.
  • For builders, the takeaway is simple: old math benchmarks are saturating fast, so the next wave needs fresher, contamination-resistant proof tasks.
// TAGS
matharenabenchmarkllmreasoning

DISCOVERED

14d ago

2026-03-28

PUBLISHED

14d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

Wonderful_Buffalo_32