OPEN_SOURCE ↗
REDDIT · REDDIT// 2d agoBENCHMARK RESULT
Anthropic skips ARC-AGI-3 for Claude Mythos
Anthropic's "Mythos" model dominates coding and security benchmarks but remains silent on the interactive ARC-AGI-3 test. The omission fuels speculation that even the most advanced frontier models are hitting a wall in general reasoning.
// ANALYSIS
The "Mythos" omission confirms that ARC-AGI-3 is the new "vibes check" for AGI, and right now, nobody is passing.
- –While Mythos sets records on SWE-bench (93.9%) and USAMO (97.6%), these are still "pattern-matching" domains compared to ARC's interactive reasoning.
- –The 0.37% state-of-the-art score (Gemini 3.1 Pro) suggests that current LLM architectures struggle with novel rule discovery in game-like environments.
- –Anthropic’s silence likely stems from marketing; even a "good" 1% score looks like a failure to a non-technical audience used to 90%+ benchmark numbers.
- –Project Glasswing's restricted access makes independent verification impossible, allowing Anthropic to curate the model's performance narrative.
- –The extreme token costs ($125/M output) suggest Mythos is optimized for high-value reasoning, making its potential failure on basic ARC tasks even more telling.
// TAGS
anthropicclaude-mythosllmbenchmarkreasoningsafety
DISCOVERED
2d ago
2026-04-10
PUBLISHED
2d ago
2026-04-10
RELEVANCE
8/ 10
AUTHOR
Neurogence