REDDIT · REDDIT// 1d agoBENCHMARK RESULT

ARC-AGI-3 Holds GPT-5.5, Opus 4.7 Below 1%

ARC Prize’s latest ARC-AGI-3 analysis shows GPT-5.5 High at 0.43% and Claude Opus 4.7 at 0.18% on the benchmark. The results underline how far frontier models still are from robust novel-environment reasoning, even when they look strong on more familiar evals.

// ANALYSIS

This is a useful reality check: the frontier is still very good at pattern completion, but brittle when the task demands building a world model from scratch and adapting over time.

–Both scores are effectively near-zero in human terms, so the gap is less about who “wins” and more about how far the whole class of models is from the target
–ARC-AGI-3’s replay-based analysis matters as much as the score because it exposes failure modes, not just leaderboard position
–GPT-5.5 appears to explore more hypotheses, while Opus 4.7 seems better at short-horizon mechanics, but neither reliably converts local discoveries into durable strategy
–For agent builders, this reinforces that tool access and harnesses can mask weak generalization; true autonomy still needs better planning, memory, and adaptation

// TAGS

arc-agi-3gpt-5.5claude-opus-4.7benchmarkevaluationreasoningllmresearch

DISCOVERED

1d ago

2026-05-02

PUBLISHED

1d ago

2026-05-02

RELEVANCE

10/ 10

AUTHOR

skazerb