OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoBENCHMARK RESULT
AI tops 40% on Humanity's Last Exam
Humanity's Last Exam (HLE), a 2,500-question benchmark co-developed by the Center for AI Safety and Scale AI and published in Nature, now sees top models scoring ~40%—up from single digits at launch in early 2025. Expert humans in their domains still average ~90%, making the gap stark.
// ANALYSIS
The jump from sub-10% to 40%+ in roughly one year is remarkable, but the remaining 60% gap shows frontier AI still lacks deep expert-level reasoning.
- –HLE was built by nearly 1,000 researchers who deliberately excluded any question current AI could answer, making it a genuine frontier benchmark by design
- –At launch: GPT-4o at 2.7%, o1 at 8%; by March 2026, Gemini 3.1 Pro Preview leads at 44.7%, Claude Opus 4.6 Thinking at 34.4%
- –Expert humans in their respective fields average ~90%—a 45+ point gap that persists despite rapid AI improvement
- –Now published in Nature, HLE has become the canonical academic benchmark for measuring frontier model progress
- –The rapid improvement rate raises an uncomfortable question: how long before HLE itself needs to be replaced?
// TAGS
benchmarkllmreasoningresearchsafety
DISCOVERED
27d ago
2026-03-16
PUBLISHED
27d ago
2026-03-16
RELEVANCE
8/ 10
AUTHOR
PixeledPathogen