METR Time Horizon 1.1 shows faster gains

// 81d agoBENCHMARK RESULT

METR Time Horizon 1.1 shows faster gains

METR’s updated time-horizon benchmark expands its autonomous-task suite from 170 to 228 tasks, doubles the number of 8-hour-plus tasks, and finds frontier agent progress since 2023 is speeding up rather than flattening. Under TH1.1, the post-2023 doubling time drops to about 131 days versus 165 days in the prior setup, pushing the conversation closer to multi-hour software work.

// ANALYSIS

The notable part here is not just that agents improved, but that METR made the benchmark tougher and the curve still got steeper. This is one of the clearest attempts to translate “AI is getting better” into an operational estimate of how much real low-context software work frontier agents can reliably finish.

–TH1.1 adds 73 tasks, removes 15 weak ones, updates 53, and raises the count of 8h+ tasks from 14 to 31, which makes the benchmark materially more informative at the high end.
–METR says the new post-2023 doubling time is about 20% faster than in TH1, suggesting recent frontier gains are compressing the timeline toward multi-hour autonomous work.
–The shift is partly about task composition: older GPT-4 variants moved down while newer models like GPT-5 moved up, so the benchmark is refining what “time horizon” actually measures rather than just rubber-stamping hype.
–The move from METR’s in-house Vivaria stack to the open-source Inspect framework makes the benchmark easier to scrutinize and harder to dismiss as a one-off internal eval.
–METR is explicit that this is not “AI can now do whole jobs”; the tasks are mostly well-specified software, ML, and cyber work done with low prior context, which is exactly why the benchmark is useful and easy to overread.

// TAGS

time-horizon-1-1benchmarkagentresearchai-coding

DISCOVERED

81d ago

2026-03-06

PUBLISHED

81d ago

2026-03-06

RELEVANCE

9/ 10

AUTHOR

Wes Roth

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE1h ago

Cursor adds dedicated subagents for skills

Cursor now allows developers to execute tool-heavy or research-intensive agent skills within dedicated subagents. This architectural shift isolates noisy background tasks, keeping the main chat context clean and focused.

UPDATE2h ago

YouTube moves AI labels to video player

YouTube is moving its AI content disclosures from video descriptions to more prominent placements beneath the player and on Shorts overlays. Starting in May, the platform will use internal signals to automatically label photorealistic AI content that creators fail to disclose.

OPEN SOURCE5h ago

Taste Skill kills AI "frontend slop"

Taste-Skill is an open-source framework that provides portable "agent skills" to enforce high-end design principles in AI-generated code. By injecting specific design directives and "anti-slop" rules, it enables LLMs to produce editorial-grade UIs that bypass generic, boilerplate-heavy AI templates.