YT · YOUTUBE// 1d agoBENCHMARK RESULT

ARC-AGI-3 shows frontier models struggle with novelty

ARC-AGI-3 is an interactive benchmark for novelty, sparse feedback, and continual learning in unfamiliar environments. ARC Prize’s May 1, 2026 analysis found GPT-5.5 scored 0.43% and Opus 4.7 scored 0.18% on the semi-private dataset, with replay-based runs making the failure modes visible.

// ANALYSIS

Hot take: this is less a “models are bad” story than a reminder that novelty is still the hard part, and current agents often confuse short-term pattern matching for understanding.

–The scores are genuinely weak for frontier systems: GPT-5.5 at 0.43% and Opus 4.7 at 0.18%.
–The interesting signal is qualitative, not just quantitative: the analysis highlights false world models, bad abstraction, and failure to carry learning forward.
–ARC-AGI-3 is positioned as an agent benchmark, not a static puzzle set, so it is closer to real-world tool use and adaptation than many traditional evals.
–The main takeaway for agent builders is that “solving a level” is not the same as learning the game; recovery from wrong assumptions remains a major gap.
–The video framing as a reality check is fair: progress in agents is real, but robust open-ended adaptation is still far from solved.

// TAGS

arc-agi-3llmbenchmarkevaluationnoveltyagentfrontier-modelsgpt-5.5opus-4.7general-intelligence

DISCOVERED

1d ago

2026-05-02

PUBLISHED

1d ago

2026-05-02

RELEVANCE

8/ 10

AUTHOR

WorldofAI