
RuneBench tests AI agent planning in RuneScape
RuneBench is an open-source evaluation benchmark designed to measure the planning capabilities and process reliability of AI coding agents. Using a TypeScript SDK, agents must navigate game systems, consult wiki documentation, and optimize for max XP rate to achieve long-horizon goals.
Game environments provide a robust sandbox for agent evaluation, challenging models with long-horizon tasks and dynamic state changes that static code benchmarks cannot replicate.
- –The use of RuneScape creates a highly structured yet complex environment with clear feedback loops and success metrics.
- –Requiring agents to read and act on wiki documentation tests real-world documentation-reading and tool-use capabilities.
- –Measuring performance through efficiency (e.g., XP rate) instead of binary success/failure forces agents to optimize strategies dynamically.
DISCOVERED
2h ago
2026-07-01
PUBLISHED
2h ago
2026-07-01
RELEVANCE
AUTHOR
Wes Roth
