Claude Opus 4.8 doubles GPT-5.5 on FrontierCode

// 45d agoBENCHMARK RESULT

Claude Opus 4.8 doubles GPT-5.5 on FrontierCode

Cognition has launched FrontierCode, an ultra-hard software engineering benchmark that evaluates AI coding agents on code mergeability by human maintainers rather than just passing unit tests. In initial evaluations on the 50 most difficult "Diamond" tier tasks, Claude Opus 4.8 leads with 13.4%, outperforming OpenAI's GPT-5.5 at 6.3%.

// ANALYSIS

Coding benchmarks are finally growing up, shifting from simple test-passing to the actual, subjective standard of whether a human maintainer would merge the code—and the results show that even our best models are still far from autonomous software engineers.

* The "mergeability" standard is a much higher and more realistic bar than unit-test passing, exposing significant gaps in agent code quality, scope control, and testing practices.

* Claude Opus 4.8's dominance (13.4%) over GPT-5.5 (6.3%) suggests that Anthropic's reasoning and agentic capabilities currently hold a substantial lead in handling complex, multi-step software engineering tasks.

* The extremely low scores across all models on the Diamond tier (all under 14%) highlight that fully autonomous code generation in real-world, production repositories remains an unsolved problem.

// TAGS

ai-codingbenchmarkcognitionfrontiercodeclaude-opusgpt-5.5software-engineering

DISCOVERED

45d ago

2026-06-09

PUBLISHED

45d ago

2026-06-09

RELEVANCE

8/ 10

AUTHOR

ollobrains

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE1h ago

Notion AI adds cloud coding VMs, review swarms

Notion AI now supports connecting directly to remote development environments, allowing users to automate the full software development lifecycle from within Notion. The integration enables Notion AI to create detailed technical specifications, write code, execute tasks inside cloud virtual machines, deploy multi-agent review swarms for automated code audits, and manage pull request iterations seamlessly.

POLICY3h ago

India orders GitHub to remove Jack Dorsey-backed Bitchat

The Indian government has ordered GitHub to remove Bitchat, an open-source peer-to-peer messaging application backed by Jack Dorsey that enables offline communication over Bluetooth mesh networks. Authorities cited national security concerns over the app's unmonitorable off-grid capabilities during network blackouts and protests.

BENCHMARK3h ago

Cole Medin releases open-source AI agent reliability benchmark

Creator Cole Medin has released an open-source benchmark repository built to evaluate AI coding agents like Kimi K3 on real-world engineering challenges rather than sanitized leaderboards. The suite includes evaluation workflows, prompt templates, and a seven-dimension scoring rubric designed to stress-test agentic reliability and expose common failure modes.