METR data revives AI R&D odds
Ajeya Cotra says recent METR benchmark results and Claude Opus 4.6’s jump to roughly a 12-hour 50% software-task horizon made her January 2026 forecast look too conservative. She now argues a 10% chance of full AI R&D automation this year is no longer easy to dismiss, especially if long projects decompose into agent-manageable chunks.
This is one of the sharper near-term AI progress arguments because it turns benchmark movement into a labor-market claim, not just another “models got better” post. The big question is no longer whether agents can code, but whether orchestration and decomposition let them scale from eight-hour tasks to real research org throughput faster than expected.
- –METR’s time-horizon framing is becoming a high-signal way to translate eval scores into what frontier agents can actually do for engineering teams
- –Cotra’s update is driven less by hype than by the pace of recent benchmark movement, especially the apparent jump from roughly five-hour to roughly twelve-hour tasks in a couple of months
- –The most important claim is not that one agent can do year-long work alone, but that cheap agent teams plus heavy scaffolding could chew through decomposable projects surprisingly well
- –This still falls short of proving end-to-end AI R&D automation, since Cotra explicitly flags research judgment, creativity, and high-reliability execution as unresolved bottlenecks
- –If current suites are saturating, labs will need messier 80% and 95% reliability evals fast, or they will lose visibility into whether agents are nearing real deployment thresholds
DISCOVERED
37d ago
2026-03-06
PUBLISHED
37d ago
2026-03-05
RELEVANCE
AUTHOR
SteppenAxolotl