PlanBench-XL tests tool-use agents in massive ecosystems
PlanBench-XL is a dynamic benchmark that evaluates long-horizon planning for LLM agents across 1,665 tools and 327 tasks. Unlike previous benchmarks, it introduces retrieval-limited visibility and dynamic tool failures to simulate real-world unpredictability.
Most tool-use evaluations assume perfect environments, but PlanBench-XL exposes how fragile current LLM agents are when things break. The massive performance drop under constraints shows the industry is still far from robust autonomous agents.
- –High-performing models like GPT-5.4 plummet from 52% to 11% accuracy when subjected to severe tool blocking and missing paths
- –Agents specifically struggle when tool failures lack explicit error signals, failing to backtrack or find alternative solutions
- –The benchmark enforces partial tool visibility, requiring agents to actively retrieve and explore rather than reasoning over a fully known context
- –It serves as a vital diagnostic testbed for identifying planning failures in large-scale, imperfect environments
DISCOVERED
1h ago
2026-06-25
PUBLISHED
1d ago
2026-06-23
RELEVANCE
AUTHOR
_akhaliq