ARC-AGI-3 benchmark retools scoring around efficiency

// 109d agoRESEARCH PAPER

ARC-AGI-3 benchmark retools scoring around efficiency

ARC Prize's ARC-AGI-3 benchmark uses RHAE, a per-level score anchored to the second-best first-run human, squared for efficiency and capped at 100% (https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf). The official ARC-AGI-3 page frames it as an interactive reasoning benchmark, and the harness-free eval makes it measure first-run behavior more than the tool-heavy agent workflows driving recent coding-assistant hype (https://arcprize.org/arc-agi/3).

// ANALYSIS

ARC-AGI-3 is a better benchmark for agent behavior than ARC-AGI-1/2, but it is also a narrower one. It measures first-pass interaction efficiency under a tightly controlled protocol, so comparing it head-to-head with older static ARC scores is a little apples-to-oranges.

–Because per-level credit is capped at 1.0, superhuman efficiency on some levels cannot offset a weak level; 100% only happens when every weighted level reaches the human bar.
–The human baseline comes from 10 first-time testers per environment and uses the second-best run, so the bar is intentionally set at strong novice performance rather than average-person comfort.
–Later levels are weighted more heavily, so the benchmark rewards end-to-end mastery instead of cheap early progress.
–The paper splits official vs community leaderboards: official scores exclude hand-tuned harnesses, while the community board is for harness research and self-reported results. That keeps comparisons cleaner, but it also means the headline number is intentionally not the best tool-augmented setup.
–The docs say only environment interactions count; tool calls, retries, and internal reasoning are excluded, which keeps the metric focused on what the agent does in the game rather than hidden scratch work (https://docs.arcprize.org/methodology).

// TAGS

arc-agi-3benchmarkresearchreasoningagent

DISCOVERED

109d ago

2026-03-25

PUBLISHED

109d ago

2026-03-25

RELEVANCE

9/ 10

AUTHOR

FateOfMuffins

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE37m ago

OpenDesign integrates Meta Muse Spark API

OpenDesign is an open-source, local-first design workspace that can be paired with Meta's Muse Spark to generate code-ready prototypes and UI screens directly from screenshots and prompts. This integration bridges the gap between visual design and software development, providing developers with an interactive workspace to rapidly iterate on AI-generated user interfaces.

UPDATE37m ago

T3 Code updates agent GUI with git worktrees

T3 Code has updated its local-first GUI for orchestrating AI coding agents, adding multi-provider key and subscription management. The release also introduces native support for git worktrees, custom automation actions, and side-by-side split diffs to safely run multiple agent workflows in parallel.

UPDATE2h ago

Grok Build adds multiline input, scrolling

SpaceXAI has released Grok Build versions 0.2.99 and 0.2.98, introducing multiline input and terminal scrolling for its terminal-based AI coding assistant. The updates allow users to input complex prompts directly on the dashboard and scroll through chat histories using PageUp and PageDown.