OSWorld 2.0 evaluates long-horizon AI agents
OSWorld 2.0 is a benchmark suite developed by XLANG Lab containing 108 professional-grade workflows to evaluate computer-use AI agents on long-horizon tasks. Initial evaluations on frontier models like Claude Opus 4.8 and GPT-5.5 reveal major performance bottlenecks, with the top model completing only 20.6% of tasks successfully.
Current computer-use AI agents are still far from ready for autonomous workspace execution, as long-horizon tasks expose their tendency to lose context and fail at self-verification.
* The transition from short-term tasks (under 30 steps) to multi-hour workflows (300+ steps) highlights that state tracking and error recovery remain major bottlenecks for frontier LLMs.
* The poor success rate of top-tier models (maxing out at 20.6% binary completion) suggests that current agent architectures struggle heavily with dynamic, streaming information that changes mid-task.
* Fine-grained final-state checking in OSWorld 2.0 will force the research community to move past simplistic pass/fail evaluation and focus on partial progress scoring and state validation.
DISCOVERED
1h ago
2026-06-30
PUBLISHED
2h ago
2026-06-30
RELEVANCE
AUTHOR
_akhaliq