AI tops 40% on Humanity's Last Exam

// 73d agoBENCHMARK RESULT

AI tops 40% on Humanity's Last Exam

Humanity's Last Exam (HLE), a 2,500-question benchmark co-developed by the Center for AI Safety and Scale AI and published in Nature, now sees top models scoring ~40%—up from single digits at launch in early 2025. Expert humans in their domains still average ~90%, making the gap stark.

// ANALYSIS

The jump from sub-10% to 40%+ in roughly one year is remarkable, but the remaining 60% gap shows frontier AI still lacks deep expert-level reasoning.

–HLE was built by nearly 1,000 researchers who deliberately excluded any question current AI could answer, making it a genuine frontier benchmark by design
–At launch: GPT-4o at 2.7%, o1 at 8%; by March 2026, Gemini 3.1 Pro Preview leads at 44.7%, Claude Opus 4.6 Thinking at 34.4%
–Expert humans in their respective fields average ~90%—a 45+ point gap that persists despite rapid AI improvement
–Now published in Nature, HLE has become the canonical academic benchmark for measuring frontier model progress
–The rapid improvement rate raises an uncomfortable question: how long before HLE itself needs to be replaced?

// TAGS

benchmarkllmreasoningresearchsafety

DISCOVERED

73d ago

2026-03-16

PUBLISHED

73d ago

2026-03-16

RELEVANCE

8/ 10

AUTHOR

PixeledPathogen

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL13m ago

Gemini 3.5 Flash powers Archon UI design

Google's latest 3.5 Flash model integrates with the Archon coding harness to deliver high-fidelity frontend designs via specialized agentic workflows. The model features a 1M context window and optimized reasoning for autonomous, multi-step development tasks.

NEWS14m ago

BridgeMind hits $193K ARR via vibe coding

BridgeMind AI founder Matthew Miller reports reaching $193,248 in Annual Recurring Revenue as part of his "vibe coding" challenge. The project demonstrates the commercial viability of "agentic organizations" where small teams leverage autonomous AI agents to ship and scale production software at high velocity.

OPEN SOURCE29m ago

OpenBMB launches PilotDeck "agent OS" for WorkSpaces

PilotDeck is an open-source productivity platform that organizes AI agents into isolated "WorkSpaces" with dedicated file systems and memory. Developed by OpenBMB and Tsinghua University, it focuses on production-grade reliability and cost efficiency for complex, multi-project workflows.