UC Berkeley debuts Agents' Last Exam

// 45d agoBENCHMARK RESULT

UC Berkeley debuts Agents' Last Exam

UC Berkeley has introduced "Agents' Last Exam" (ALE), a comprehensive benchmark evaluating AI agents on long-horizon, economically valuable tasks across 13 industry clusters. Baseline testing on frontier AI agents reveals a massive capability gap, with models achieving a pass rate of just 2.6%.

// ANALYSIS

Current frontier AI agents are not yet ready for autonomous, real-world economic tasks, failing to maintain accuracy over long-horizon workflows.

* The low 2.6% pass rate underscores a massive gap between current agent capabilities and real-world job requirements.

* Covering 13 industry clusters ensures the benchmark measures diverse, practical workflows rather than narrow, synthetic tasks.

* The benchmark establishes a rigorous, much-needed standard for measuring agentic progress as LLM developers pivot towards agentic systems.

// TAGS

agentbenchmarkuc-berkeleyartificial-intelligenceagent-workflows

DISCOVERED

45d ago

2026-06-07

PUBLISHED

45d ago

2026-06-07

RELEVANCE

8/ 10

AUTHOR

Discover AI

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS52m ago

AMD partners with Anthropic on AI compute

AMD and Anthropic have entered into a strategic partnership to accelerate AI compute infrastructure, with Anthropic deploying up to 2 gigawatts of AMD Instinct GPUs on Helios systems. Under the agreement, the companies will co-optimize Claude models for AMD's ROCm ecosystem alongside a planned strategic equity investment of up to $5 billion by AMD.

UPDATE1h ago

Plannotator expands its agentic code review tool with support for GitButler projects alongside Git, Jujutsu, and Perforce

Plannotator, an open-source visual review tool designed to inspect and annotate code generated by AI agents, has officially released support for GitButler projects across all recent builds. Joining existing compatibility with Git, Jujutsu (jj), and Perforce (p4), this update allows developers using GitButler's virtual branches to seamlessly review AI outputs and feed structured inline annotations back into agentic loops.

OPEN SOURCE1h ago

Infinite Bookshelf generates complete books in seconds

Infinite Bookshelf is an open-source application designed to generate complete, structured nonfiction books from a one-line prompt. Powered by Groq's fast inference engine and Meta's Llama models, the project dynamically switches between model sizes to balance speed and output quality. The generated books feature complete markdown formatting, including embedded data tables and code examples.