GPT-5.4 pro tops CritPt at 30%
Artificial Analysis says GPT-5.4 pro (xhigh) leads its CritPt benchmark for research-level physics reasoning with a 30.0% score, ahead of GPT-5.4 at 20.0% and Gemini 3.1 Pro Preview at 17.7%. That is a notable jump on one of the tougher frontier-science evals around, even if the benchmark still shows models are far from reliably solving full research-scale physics problems.
CritPt is the kind of benchmark that actually matters because it tests unpublished, guess-resistant research tasks instead of polished textbook problems. GPT-5.4 pro’s result looks less like “AI can do physics now” and more like proof that frontier reasoning models are finally moving the needle on hard science benchmarks.
- –Artificial Analysis describes CritPt as 71 composite challenges created by 50+ active physics researchers across 11 subfields, which gives the result more weight than typical exam-style evals
- –The leaderboard gap is real: 30.0% for GPT-5.4 pro versus 20.0% for GPT-5.4 and 17.7% for Gemini 3.1 Pro Preview
- –The benchmark page still says leading models remain far from reliably solving full research-scale challenges, so the headline is progress under harsh conditions, not scientific autonomy
- –For AI developers, this is a signal that premium reasoning configurations are starting to separate themselves on specialized expert tasks, not just generic coding and math leaderboards
DISCOVERED
35d ago
2026-03-07
PUBLISHED
35d ago
2026-03-07
RELEVANCE
AUTHOR
kaggleqrdl