AI tops 40% on Humanity's Last Exam
Humanity's Last Exam (HLE), a 2,500-question benchmark co-developed by the Center for AI Safety and Scale AI and published in Nature, now sees top models scoring ~40%—up from single digits at launch in early 2025. Expert humans in their domains still average ~90%, making the gap stark.
The jump from sub-10% to 40%+ in roughly one year is remarkable, but the remaining 60% gap shows frontier AI still lacks deep expert-level reasoning.
- –HLE was built by nearly 1,000 researchers who deliberately excluded any question current AI could answer, making it a genuine frontier benchmark by design
- –At launch: GPT-4o at 2.7%, o1 at 8%; by March 2026, Gemini 3.1 Pro Preview leads at 44.7%, Claude Opus 4.6 Thinking at 34.4%
- –Expert humans in their respective fields average ~90%—a 45+ point gap that persists despite rapid AI improvement
- –Now published in Nature, HLE has become the canonical academic benchmark for measuring frontier model progress
- –The rapid improvement rate raises an uncomfortable question: how long before HLE itself needs to be replaced?
DISCOVERED
73d ago
2026-03-16
PUBLISHED
73d ago
2026-03-16
RELEVANCE
AUTHOR
PixeledPathogen