REDDIT · REDDIT// 2d agoBENCHMARK RESULT

Anthropic skips ARC-AGI-3 for Claude Mythos

Anthropic's "Mythos" model dominates coding and security benchmarks but remains silent on the interactive ARC-AGI-3 test. The omission fuels speculation that even the most advanced frontier models are hitting a wall in general reasoning.

// ANALYSIS

The "Mythos" omission confirms that ARC-AGI-3 is the new "vibes check" for AGI, and right now, nobody is passing.

–While Mythos sets records on SWE-bench (93.9%) and USAMO (97.6%), these are still "pattern-matching" domains compared to ARC's interactive reasoning.
–The 0.37% state-of-the-art score (Gemini 3.1 Pro) suggests that current LLM architectures struggle with novel rule discovery in game-like environments.
–Anthropic’s silence likely stems from marketing; even a "good" 1% score looks like a failure to a non-technical audience used to 90%+ benchmark numbers.
–Project Glasswing's restricted access makes independent verification impossible, allowing Anthropic to curate the model's performance narrative.
–The extreme token costs ($125/M output) suggest Mythos is optimized for high-value reasoning, making its potential failure on basic ARC tasks even more telling.

// TAGS

anthropicclaude-mythosllmbenchmarkreasoningsafety

DISCOVERED

2d ago

2026-04-10

PUBLISHED

2d ago

2026-04-10

RELEVANCE

8/ 10

AUTHOR

Neurogence