CyberGym benchmark reveals massive AI hacking leap

// 45d agoBENCHMARK RESULT

CyberGym benchmark reveals massive AI hacking leap

New results from UC Berkeley’s CyberGym benchmark show frontier models like GPT-5.5 and Claude Mythos achieving over 80% success in autonomous vulnerability reproduction. The framework, which spans 1,507 real-world tasks, has already helped agents discover 35 new zero-day vulnerabilities.

// ANALYSIS

CyberGym is graduating from a research project to the definitive metric for agentic security capabilities, effectively becoming the "SWE-bench" of offensive research.

–Frontier models have jumped from ~12% to over 80% success in less than a year, signaling a breakthrough in long-horizon reasoning and codebase navigation.
–The discovery of 35 zero-days in major libraries like OpenSSL proves that AI agents can now outperform traditional fuzzing and human review in specific contexts.
–High performance on CyberGym was reportedly a key factor in Anthropic's decision to gate "Mythos," illustrating how benchmarks are now driving safety policy.
–The shift to execution-based eval (requiring a working PoC) prevents "data contamination" leaks that plague static security benchmarks.
–Microsoft’s MDASH now leads the leaderboard at 88.45%, demonstrating the efficacy of multi-agent architectures in complex security tasks.

// TAGS

cybergymbenchmarksecurityagentllmevaluationuc-berkeleyai-coding

DISCOVERED

45d ago

2026-05-15

PUBLISHED

45d ago

2026-05-15

RELEVANCE

9/ 10

AUTHOR

Wes Roth

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE2h ago

llm-d orchestrates Kubernetes LLM inference

llm-d is a Kubernetes-native orchestration framework for distributed and disaggregated LLM inference serving on top of engines like vLLM and SGLang. By integrating with the Kubernetes Gateway API (Inference Extension), llm-d provides prefix-cache-aware routing, tiered KV-cache offloading, disaggregated prefill/decode serving, and SLO-aware autoscaling based on queue demand.

NEWS3h ago

xAI to release new model every month

Elon Musk has announced that xAI plans to release a brand-new AI model every month for the remainder of the year, signaling a pivot toward rapid, continuous iteration. Leveraging infrastructure and feedback from SpaceX and Starlink, this monthly roadmap aims to accelerate the deployment of trained-from-scratch models.

NEWS3h ago

GPT-5.6 Leads Polymarket Top AI Race

OpenAI's GPT-5.6 leads the Polymarket prediction race for the top AI model by June 30, with Sakana AI's newly launched Fugu platform emerging as a wildcard challenger. While OpenAI remains the frontrunner, rapid multi-agent developments and infrastructure upgrades continue to shift trader expectations before the deadline.