REDDIT · REDDIT// 18h agoBENCHMARK RESULT

Raw log search makes ARC-AGI-3 tractable

This blog post argues that harness design makes a large difference on ARC-AGI-3. Instead of relying on a one-shot agent, the authors save full game logs, including actions, board states, and scores, then let an LLM search over those logs with tools.

// ANALYSIS

Hot take: ARC-AGI-3 looks much less like a pure “model intelligence” test and much more like a test of whether your agent can do structured search over state history.

–The main claim is that harnessing is not a minor tweak here; log search materially changes outcomes.
–The authors report frontier LLMs failing to progress far with naive agenting, but performing far better when given raw logs and retrieval-like search.
–The post pushes back on the idea that added tooling always hurts benchmark validity; for this benchmark, it appears to recover a lot of usable performance.
–The strongest result is the contrast between tool-light agents and agents that can inspect many thousands of logged states.
–The Python example suggests that once the system recognizes a classic substructure, it can switch from heuristic search to exact algorithmic solving.

// TAGS

arc-agi-3llm-agentsharnesstool-usesearchgame-logsbenchmarkreasoning

DISCOVERED

18h ago

2026-05-02

PUBLISHED

19h ago

2026-05-02

RELEVANCE

8/ 10

AUTHOR

ClarityInMadness