GPT-5.4 pro tops CritPt at 30%

// 81d agoBENCHMARK RESULT

GPT-5.4 pro tops CritPt at 30%

Artificial Analysis says GPT-5.4 pro (xhigh) leads its CritPt benchmark for research-level physics reasoning with a 30.0% score, ahead of GPT-5.4 at 20.0% and Gemini 3.1 Pro Preview at 17.7%. That is a notable jump on one of the tougher frontier-science evals around, even if the benchmark still shows models are far from reliably solving full research-scale physics problems.

// ANALYSIS

CritPt is the kind of benchmark that actually matters because it tests unpublished, guess-resistant research tasks instead of polished textbook problems. GPT-5.4 pro’s result looks less like “AI can do physics now” and more like proof that frontier reasoning models are finally moving the needle on hard science benchmarks.

–Artificial Analysis describes CritPt as 71 composite challenges created by 50+ active physics researchers across 11 subfields, which gives the result more weight than typical exam-style evals
–The leaderboard gap is real: 30.0% for GPT-5.4 pro versus 20.0% for GPT-5.4 and 17.7% for Gemini 3.1 Pro Preview
–The benchmark page still says leading models remain far from reliably solving full research-scale challenges, so the headline is progress under harsh conditions, not scientific autonomy
–For AI developers, this is a signal that premium reasoning configurations are starting to separate themselves on specialized expert tasks, not just generic coding and math leaderboards

// TAGS

gpt-5.4-prollmreasoningbenchmarkresearch

DISCOVERED

81d ago

2026-03-07

PUBLISHED

81d ago

2026-03-07

RELEVANCE

9/ 10

AUTHOR

kaggleqrdl

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS20m ago

ElevenLabs, Greece partner on voice AI gov services

ElevenLabs signed a Memorandum of Understanding with the Greek government to integrate voice AI into the gov.gr portal, automate public service call centers, and preserve regional dialects like Cretan. The initiative aims to modernize bureaucracy and tourism through natural language interaction and linguistic heritage preservation.

VIDEO1h ago

Mistral Vibe wires connectors into CLI workflows

Mistral Vibe’s connector layer lets the terminal agent reach into external services from one workflow. The demo shows it reading requirements, editing code, opening a GitHub PR, and updating Linear without leaving the CLI.

NEWS3h ago

Dev lets Claude trade BTC overnight, nets $95 profit

A developer gave Claude a $20 budget to autonomously script and execute Bitcoin trades overnight, waking up to a functional trading bot and a $95 profit across five trades.