MathArena: GPT-5.4 saturates USAMO 2026

// 59d agoBENCHMARK RESULT

MathArena: GPT-5.4 saturates USAMO 2026

MathArena's USAMO 2026 writeup shows GPT-5.4 reaching 95.2% after human validation, versus Gemini-2.5-Pro's 25% top score on the 2025 baseline. The benchmark that once exposed near-total proof-writing failure now looks close to saturated at the frontier.

// ANALYSIS

This looks less like steady improvement and more like a benchmark phase change. The best closed models are brushing up against the ceiling on proof-heavy olympiad math, but that does not mean the broader reasoning problem is solved.

–GPT-5.4 posted 95.2% after human validation; Gemini-3.1-Pro landed at 74.4% and Opus-4.6 at 47.0%.
–The 2025 baseline was far harsher: Gemini-2.5-Pro peaked at 25%, while most other models stayed below 5%.
–Failure modes shifted from blunt guessing and circular reasoning to subtler proof-structure mistakes and occasional token-budget exhaustion.
–The LLM jury largely matched human review, and GPT-5.4 was the strongest judge, which makes the result feel like genuine progress rather than grading noise.
–For builders, the takeaway is simple: old math benchmarks are saturating fast, so the next wave needs fresher, contamination-resistant proof tasks.

// TAGS

matharenabenchmarkllmreasoning

DISCOVERED

59d ago

2026-03-28

PUBLISHED

60d ago

2026-03-28

RELEVANCE

8/ 10

AUTHOR

Wonderful_Buffalo_32

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE2h ago

Cursor adds dedicated subagents for skills

Cursor now allows developers to execute tool-heavy or research-intensive agent skills within dedicated subagents. This architectural shift isolates noisy background tasks, keeping the main chat context clean and focused.

UPDATE2h ago

YouTube moves AI labels to video player

YouTube is moving its AI content disclosures from video descriptions to more prominent placements beneath the player and on Shorts overlays. Starting in May, the platform will use internal signals to automatically label photorealistic AI content that creators fail to disclose.

OPEN SOURCE6h ago

Taste Skill kills AI "frontend slop"

Taste-Skill is an open-source framework that provides portable "agent skills" to enforce high-end design principles in AI-generated code. By injecting specific design directives and "anti-slop" rules, it enables LLMs to produce editorial-grade UIs that bypass generic, boilerplate-heavy AI templates.