OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
ARC-AGI-3 sets elite human reasoning baseline
The ARC Prize Foundation has updated its reasoning benchmark to anchor the 100% human baseline to "near-elite" action efficiency. The new RHAE scoring system rigorously penalizes inefficient AI solutions, leaving current frontier models with scores below 1%.
// ANALYSIS
The shift from binary success to action efficiency represents a brutal raising of the bar for reasoning models.
- –The 100% baseline is now defined by the second-best first-run human performance, effectively requiring AI to match top-tier human intuition.
- –RHAE (Relative Human Action Efficiency) applies a squared penalty for excessive steps, causing scores to drop precipitously for "brute-force" or inefficient solvers.
- –Frontier models like Gemini 3.1 and GPT-5 currently struggle to break 1%, highlighting the persistent "fluid intelligence" gap in LLMs.
- –While community "harnesses" have reached 36% by providing better context, official "text-in, text-out" evaluation remains the gold standard for pure reasoning.
- –This update signals a pivot in the AGI debate: it's no longer just about solving the puzzle, but solving it with the same minimal priors humans use.
// TAGS
arc-agi-3arc-agireasoningbenchmarkllmresearch
DISCOVERED
4h ago
2026-04-15
PUBLISHED
5h ago
2026-04-14
RELEVANCE
9/ 10
AUTHOR
exordin26