BACK_TO_FEEDAICRIER_2
AI tops 40% on Humanity's Last Exam
OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoBENCHMARK RESULT

AI tops 40% on Humanity's Last Exam

Humanity's Last Exam (HLE), a 2,500-question benchmark co-developed by the Center for AI Safety and Scale AI and published in Nature, now sees top models scoring ~40%—up from single digits at launch in early 2025. Expert humans in their domains still average ~90%, making the gap stark.

// ANALYSIS

The jump from sub-10% to 40%+ in roughly one year is remarkable, but the remaining 60% gap shows frontier AI still lacks deep expert-level reasoning.

  • HLE was built by nearly 1,000 researchers who deliberately excluded any question current AI could answer, making it a genuine frontier benchmark by design
  • At launch: GPT-4o at 2.7%, o1 at 8%; by March 2026, Gemini 3.1 Pro Preview leads at 44.7%, Claude Opus 4.6 Thinking at 34.4%
  • Expert humans in their respective fields average ~90%—a 45+ point gap that persists despite rapid AI improvement
  • Now published in Nature, HLE has become the canonical academic benchmark for measuring frontier model progress
  • The rapid improvement rate raises an uncomfortable question: how long before HLE itself needs to be replaced?
// TAGS
benchmarkllmreasoningresearchsafety

DISCOVERED

27d ago

2026-03-16

PUBLISHED

27d ago

2026-03-16

RELEVANCE

8/ 10

AUTHOR

PixeledPathogen