OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoBENCHMARK RESULT
ARC-AGI-3 Holds GPT-5.5, Opus 4.7 Below 1%
ARC Prize’s latest ARC-AGI-3 analysis shows GPT-5.5 High at 0.43% and Claude Opus 4.7 at 0.18% on the benchmark. The results underline how far frontier models still are from robust novel-environment reasoning, even when they look strong on more familiar evals.
// ANALYSIS
This is a useful reality check: the frontier is still very good at pattern completion, but brittle when the task demands building a world model from scratch and adapting over time.
- –Both scores are effectively near-zero in human terms, so the gap is less about who “wins” and more about how far the whole class of models is from the target
- –ARC-AGI-3’s replay-based analysis matters as much as the score because it exposes failure modes, not just leaderboard position
- –GPT-5.5 appears to explore more hypotheses, while Opus 4.7 seems better at short-horizon mechanics, but neither reliably converts local discoveries into durable strategy
- –For agent builders, this reinforces that tool access and harnesses can mask weak generalization; true autonomy still needs better planning, memory, and adaptation
// TAGS
arc-agi-3gpt-5.5claude-opus-4.7benchmarkevaluationreasoningllmresearch
DISCOVERED
1d ago
2026-05-02
PUBLISHED
1d ago
2026-05-02
RELEVANCE
10/ 10
AUTHOR
skazerb