OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoBENCHMARK RESULT
Claude Mythos leaks, crushes Opus 4.6 benchmarks
Leaked internal benchmarks for Anthropic’s unreleased Claude Mythos model reveal a generational leap in autonomous software engineering and cybersecurity exploits compared to the current Opus 4.6 flagship.
// ANALYSIS
Mythos marks the transition from LLMs that assist to models that act autonomously, specifically bridging the gap in complex cybersecurity tasks that previously required human intervention.
- –SWE-bench Verified scores in the mid-to-high 80s suggest Mythos can handle multi-file repo maintenance with minimal supervision.
- –The jump in autonomous exploit development (90%+ success on JS shells) explains Anthropic’s cautious, gate-kept preview rollout.
- –Codenamed "Capybara," the model introduces a new pricing and performance tier above the existing Opus line.
- –Terminal-Bench 2.0 scores exceeding 75% point toward a future of fully autonomous DevOps and system administration agents.
// TAGS
claude-mythosllmai-codingagentbenchmarksafetyreasoning
DISCOVERED
4d ago
2026-04-07
PUBLISHED
4d ago
2026-04-07
RELEVANCE
10/ 10
AUTHOR
Independent-Wind4462