Dawn Song launches Agents' Last Exam
UC Berkeley's Dawn Song has introduced the Agents' Last Exam (ALE), a benchmark featuring over 1,500 expert-tier professional tasks to evaluate AI agents. Initial findings show that even frontier models like Claude Fable 5 score 0% on the most complex multi-day assignments, highlighting a gap between model capabilities and real-world readiness.
AI agents aren't taking your job tomorrow: Fable 5 is a massive leap in agentic capabilities, but the "job-ready" narrative is mostly marketing hype until models can reliably execute multi-day workflows without human intervention.
* The Shift to Agentic Workflows: Claude Fable 5 represents a transition from simple chat models to long-horizon, multi-step systems capable of tool use and self-correction.
* The Reality of the "0% Success Rate": The UC Berkeley Agents' Last Exam (ALE) benchmark exposes that while frontier models can handle short-term tasks, they fail completely on highly complex, expert-tier assignments.
* Cost vs. Value: Because agentic sessions are token-intensive and expensive to run, organizations must focus on ROI (dollars per verified task completion) rather than raw capabilities.
DISCOVERED
1h ago
2026-06-12
PUBLISHED
1h ago
2026-06-12
RELEVANCE
AUTHOR
steipete