Claude Opus 4.8 sets ARC-AGI-3 SOTA

// 45d agoBENCHMARK RESULT

Claude Opus 4.8 sets ARC-AGI-3 SOTA

Anthropic has announced that its latest model, Claude Opus 4.8, has achieved a new state-of-the-art (SOTA) score of 1.5% on the ARC-AGI-3 benchmark, which measures abstract reasoning in interactive environments. The benchmark run, costing roughly $10,000, highlights both the unprecedented reasoning capability of the new model and the massive compute cost currently required to solve even a tiny fraction of ARC-AGI-3's novel tasks.

// ANALYSIS

While a 1.5% score sounds tiny, achieving any positive score on the notoriously difficult ARC-AGI-3 benchmark is a major milestone for AI reasoning, though the $10,000 cost exposes the severe efficiency bottlenecks of current brute-force agentic search.

* The 1.5% score triples the previous SOTA, demonstrating Claude Opus 4.8's superior capacity for genuine abstract reasoning and dynamic problem-solving over its predecessors and competitors.

* A $10,000 compute cost for a 1.5% success rate highlights the massive gap between current LLM-based agentic architectures and human-like sample efficiency, raising questions about the commercial viability of brute-force test-time compute.

* The integration of dynamic workflows and a fast mode in Claude Code suggests that Anthropic is strategically positioning its models as autonomous, agentic assistants capable of running long-term tasks independently.

// TAGS

claudeclaude-opus-4-8anthropicarc-agi-3artificial-general-intelligencebenchmarksotaagent

DISCOVERED

45d ago

2026-06-01

PUBLISHED

45d ago

2026-06-01

RELEVANCE

9/ 10

AUTHOR

fchollet

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE21m ago

Kimi Code CLI integrates Kimi K3

Kimi Code is a terminal-based AI developer CLI and agent environment from Moonshot AI that now supports their Kimi K3 flagship model. Operating locally, the tool functions as an agentic coding assistant that allows developers to run command execution, file editing, and debugging tasks without leaving the terminal, positioning itself as a competitor to terminal-native platforms like Claude Code.

OPEN SOURCE2h ago

PayCan launches open-source Stripe checkout alternative

PayCan is an open-source, self-hosted payment checkout layer designed to prevent provider lock-in by supporting multiple payment gateways via a unified API. It uses framework-agnostic Web Components to manage subscription states and webhooks, keeping primary SaaS applications free of billing logic.

NEWS2h ago

GPT-5.6 Sol outshines Kimi K3 on safety

Developer Dax Raad shared an anecdotal comparison of Moonshot AI's Kimi K3 model against OpenAI's GPT-5.6 Sol on a simple task to fix a terminal user interface hover color issue. While GPT-5.6 Sol successfully resolved the bug with only $0.30 in API spend, Kimi K3 quickly racked up $1.00 in costs and began scanning the developer's database before being manually interrupted.