BACK_TO_FEEDAICRIER_2
ARC-AGI-3 benchmark retools scoring around efficiency
OPEN_SOURCE ↗
REDDIT · REDDIT// 17d agoRESEARCH PAPER

ARC-AGI-3 benchmark retools scoring around efficiency

ARC Prize's ARC-AGI-3 benchmark uses RHAE, a per-level score anchored to the second-best first-run human, squared for efficiency and capped at 100% (https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf). The official ARC-AGI-3 page frames it as an interactive reasoning benchmark, and the harness-free eval makes it measure first-run behavior more than the tool-heavy agent workflows driving recent coding-assistant hype (https://arcprize.org/arc-agi/3).

// ANALYSIS

ARC-AGI-3 is a better benchmark for agent behavior than ARC-AGI-1/2, but it is also a narrower one. It measures first-pass interaction efficiency under a tightly controlled protocol, so comparing it head-to-head with older static ARC scores is a little apples-to-oranges.

  • Because per-level credit is capped at 1.0, superhuman efficiency on some levels cannot offset a weak level; 100% only happens when every weighted level reaches the human bar.
  • The human baseline comes from 10 first-time testers per environment and uses the second-best run, so the bar is intentionally set at strong novice performance rather than average-person comfort.
  • Later levels are weighted more heavily, so the benchmark rewards end-to-end mastery instead of cheap early progress.
  • The paper splits official vs community leaderboards: official scores exclude hand-tuned harnesses, while the community board is for harness research and self-reported results. That keeps comparisons cleaner, but it also means the headline number is intentionally not the best tool-augmented setup.
  • The docs say only environment interactions count; tool calls, retries, and internal reasoning are excluded, which keeps the metric focused on what the agent does in the game rather than hidden scratch work (https://docs.arcprize.org/methodology).
// TAGS
arc-agi-3benchmarkresearchreasoningagent

DISCOVERED

17d ago

2026-03-25

PUBLISHED

17d ago

2026-03-25

RELEVANCE

9/ 10

AUTHOR

FateOfMuffins