METR Time Horizon 1.1 shows faster gains
METR’s updated time-horizon benchmark expands its autonomous-task suite from 170 to 228 tasks, doubles the number of 8-hour-plus tasks, and finds frontier agent progress since 2023 is speeding up rather than flattening. Under TH1.1, the post-2023 doubling time drops to about 131 days versus 165 days in the prior setup, pushing the conversation closer to multi-hour software work.
The notable part here is not just that agents improved, but that METR made the benchmark tougher and the curve still got steeper. This is one of the clearest attempts to translate “AI is getting better” into an operational estimate of how much real low-context software work frontier agents can reliably finish.
- –TH1.1 adds 73 tasks, removes 15 weak ones, updates 53, and raises the count of 8h+ tasks from 14 to 31, which makes the benchmark materially more informative at the high end.
- –METR says the new post-2023 doubling time is about 20% faster than in TH1, suggesting recent frontier gains are compressing the timeline toward multi-hour autonomous work.
- –The shift is partly about task composition: older GPT-4 variants moved down while newer models like GPT-5 moved up, so the benchmark is refining what “time horizon” actually measures rather than just rubber-stamping hype.
- –The move from METR’s in-house Vivaria stack to the open-source Inspect framework makes the benchmark easier to scrutinize and harder to dismiss as a one-off internal eval.
- –METR is explicit that this is not “AI can now do whole jobs”; the tasks are mostly well-specified software, ML, and cyber work done with low prior context, which is exactly why the benchmark is useful and easy to overread.
DISCOVERED
36d ago
2026-03-06
PUBLISHED
36d ago
2026-03-06
RELEVANCE
AUTHOR
Wes Roth