ARC-AGI-3 Holds GPT-5.5, Opus 4.7 Below 1%
ARC Prize’s latest ARC-AGI-3 analysis shows GPT-5.5 High at 0.43% and Claude Opus 4.7 at 0.18% on the benchmark. The results underline how far frontier models still are from robust novel-environment reasoning, even when they look strong on more familiar evals.
This is a useful reality check: the frontier is still very good at pattern completion, but brittle when the task demands building a world model from scratch and adapting over time.
- –Both scores are effectively near-zero in human terms, so the gap is less about who “wins” and more about how far the whole class of models is from the target
- –ARC-AGI-3’s replay-based analysis matters as much as the score because it exposes failure modes, not just leaderboard position
- –GPT-5.5 appears to explore more hypotheses, while Opus 4.7 seems better at short-horizon mechanics, but neither reliably converts local discoveries into durable strategy
- –For agent builders, this reinforces that tool access and harnesses can mask weak generalization; true autonomy still needs better planning, memory, and adaptation
DISCOVERED
50d ago
2026-05-02
PUBLISHED
50d ago
2026-05-02
RELEVANCE
AUTHOR
skazerb