GPT-5.4 sets new FrontierMath record

// 83d agoBENCHMARK RESULT

GPT-5.4 sets new FrontierMath record

Epoch AI says GPT-5.4 Pro set a new FrontierMath record, scoring 50% on Tiers 1–3 and 38% on Tier 4, with one previously unsolved Tier 4 problem cracked in evaluation. The result matters because FrontierMath is one of the hardest public math-reasoning benchmarks, though Epoch also notes held-out vs non-held-out differences were not statistically significant.

// ANALYSIS

This is the kind of benchmark jump that keeps moving “frontier reasoning” from hype into measurable capability, but it also shows how fragile top-line scores can be when hard evals have limited sample sizes and possible shortcut paths.

–FrontierMath is unusually high-signal because the problems are original, expert-written, and far harder than mainstream math leaderboards
–GPT-5.4 Pro solving a never-before-solved Tier 4 problem is the standout detail, even more than the headline percentage
–Epoch disclosed that OpenAI funded FrontierMath and has exclusive access to many problems and solutions, so the held-out analysis is important context rather than footnote material
–The 38% Tier 4 pass@10 result suggests the model is getting meaningfully stronger with repeated attempts, which matters for agent-style workflows that can retry or branch
–One newly solved problem appears to have been shortcut via a 2011 preprint, a reminder that benchmark wins still need careful interpretation before being treated as pure reasoning breakthroughs

// TAGS

gpt-5-4llmreasoningbenchmarkresearch

DISCOVERED

83d ago

2026-03-06

PUBLISHED

83d ago

2026-03-05

RELEVANCE

10/ 10

AUTHOR

likeastar20

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS1h ago

Dev lets Claude trade BTC overnight, nets $95 profit

A developer gave Claude a $20 budget to autonomously script and execute Bitcoin trades overnight, waking up to a functional trading bot and a $95 profit across five trades.

OPEN SOURCE2h ago

Plannotator 0.19.24 adds Amp support and configurable storage

Plannotator 0.19.24 is a substantial release that expands the tool beyond Claude Code with native Amp support, adds a `PLANNOTATOR_DATA_DIR` override so users can move the default `~/.plannotator` data directory, introduces Auto Mode in the permission selector for newer Claude Code versions, and fixes a Pi approval crash after plan acceptance. The update folds multiple stacked PRs into one release and pushes the project further toward a multi-agent review layer rather than a single-agent hook utility.

NEWS3h ago

Aaronson says AI turns mathematicians into curators

Scott Aaronson says recent AI results in mathematics, including a GPT-5.5 Pro solution to Erdős’s Unit Distance Problem, suggest humans may increasingly focus on choosing questions and interpreting model outputs. He extends the argument to AI-written fiction and the Vatican’s AI encyclical as signs of a broader cultural shift.