OSWorld 2.0 evaluates long-horizon AI agents

// 1h agoBENCHMARK RESULT

OSWorld 2.0 evaluates long-horizon AI agents

OSWorld 2.0 is a benchmark suite developed by XLANG Lab containing 108 professional-grade workflows to evaluate computer-use AI agents on long-horizon tasks. Initial evaluations on frontier models like Claude Opus 4.8 and GPT-5.5 reveal major performance bottlenecks, with the top model completing only 20.6% of tasks successfully.

// ANALYSIS

Current computer-use AI agents are still far from ready for autonomous workspace execution, as long-horizon tasks expose their tendency to lose context and fail at self-verification.

* The transition from short-term tasks (under 30 steps) to multi-hour workflows (300+ steps) highlights that state tracking and error recovery remain major bottlenecks for frontier LLMs.

* The poor success rate of top-tier models (maxing out at 20.6% binary completion) suggests that current agent architectures struggle heavily with dynamic, streaming information that changes mid-task.

* Fine-grained final-state checking in OSWorld 2.0 will force the research community to move past simplistic pass/fail evaluation and focus on partial progress scoring and state validation.

// TAGS

osworld-2.0agentbenchmarkllmcomputer-usexlang-lab

DISCOVERED

1h ago

2026-06-30

PUBLISHED

2h ago

2026-06-30

RELEVANCE

8/ 10

AUTHOR

_akhaliq

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE32m ago

Vercel Agent enters public beta

Vercel has transitioned Vercel Agent into public beta, bringing interactive and autonomous AI capabilities directly into the dashboard. The update adds a built-in chat interface, automated anomaly investigations, and secure, user-approved remediation actions.

UPDATE41m ago

ElevenLabs Speech Engine adds LiveKit support

ElevenLabs' Speech Engine now integrates with LiveKit, enabling developers to connect voice agents to real-time communication rooms via a worker that bridges audio streams. This integration supports both Python and Node.js, allowing developers to build low-latency voice-driven applications using existing LLM logic.

MODEL41m ago

Claude Sonnet 5 drops, BridgeMind hosts test

BridgeMind AI announced a live broadcast to test Anthropic's newly released Claude Sonnet 5 model. The model introduces significantly improved reasoning, tool use, and coding capabilities, bringing agentic performance closer to Anthropic's flagship Opus 4.8 at a more accessible price point. The livestream will take user requests in real time to put the model's new capabilities to the test.