OPEN_SOURCE ↗
REDDIT · REDDIT// 18h agoBENCHMARK RESULT
Raw log search makes ARC-AGI-3 tractable
This blog post argues that harness design makes a large difference on ARC-AGI-3. Instead of relying on a one-shot agent, the authors save full game logs, including actions, board states, and scores, then let an LLM search over those logs with tools.
// ANALYSIS
Hot take: ARC-AGI-3 looks much less like a pure “model intelligence” test and much more like a test of whether your agent can do structured search over state history.
- –The main claim is that harnessing is not a minor tweak here; log search materially changes outcomes.
- –The authors report frontier LLMs failing to progress far with naive agenting, but performing far better when given raw logs and retrieval-like search.
- –The post pushes back on the idea that added tooling always hurts benchmark validity; for this benchmark, it appears to recover a lot of usable performance.
- –The strongest result is the contrast between tool-light agents and agents that can inspect many thousands of logged states.
- –The Python example suggests that once the system recognizes a classic substructure, it can switch from heuristic search to exact algorithmic solving.
// TAGS
arc-agi-3llm-agentsharnesstool-usesearchgame-logsbenchmarkreasoning
DISCOVERED
18h ago
2026-05-02
PUBLISHED
19h ago
2026-05-02
RELEVANCE
8/ 10
AUTHOR
ClarityInMadness