Notte benchmark exposes agent execution tax

// 45d agoBENCHMARK RESULT

Notte benchmark exposes agent execution tax

Notte and Fireworks AI ran 720 WebVoyager browser-agent tasks across four models and found that retries, not raw token price, decide what a task really costs. MiniMax M2.5 came out 2.3x cheaper per successful task than Gemini 2.5 Flash, while GLM-5 posted the best accuracy and Kimi K2.5 showed zero parse retries in the instrumented runs.

// ANALYSIS

The useful takeaway here is procurement, not bragging rights: once an agent loop starts retrying malformed outputs, “cheap” token pricing becomes a mirage.

–Gemini 2.5 Flash had the highest execution tax at 22.9%, which is exactly the kind of hidden waste that compounds in production browser workflows
–MiniMax M2.5 looks like the best default on this benchmark because it combines top-tier accuracy with the lowest cost per successful task
–GLM-5 is the safer pick when task correctness matters more than latency, especially on structured or multi-step sites
–Kimi K2.5’s zero parse retries is a strong signal that structured-output reliability can matter more than raw model intelligence in agent loops
–The benchmark is text-only and tied to a specific WebVoyager setup, so the right conclusion is not “model X wins everything” but “measure cost per outcome, not cost per token”

// TAGS

nottebenchmarkweb-agentagentinferencepricingopen-weightsevaluation

DISCOVERED

45d ago

2026-05-21

PUBLISHED

45d ago

2026-05-21

RELEVANCE

9/ 10

AUTHOR

ogandrea

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE2h ago

xAI releases Grok Build 0.2.87

Grok Build 0.2.87 is a quality-of-life release for xAI's command-line interface coding agent. The update introduces automatic detection of subscription upgrades to eliminate CLI restarts and adds a persistent "Never allow" option to Bash permission prompts.

NEWS3h ago

Developer Pairs Codex and Cursor for AI Coding

The post highlights a developer's workflow combining OpenAI's Codex model with the Cursor IDE. The developer notes that an IDE is essential for reviewing Codex's outputs and maintaining a project overview, and praises Cursor's built-in Composer 2.5 model as a highly effective tool for many development tasks.

MODEL4h ago

Grok 4.5 enters private beta

Grok 4.5, xAI's next-generation large language model, is reportedly in private beta testing at Tesla and SpaceX. Powered by a massive 1.5 trillion-parameter V9 model, its early performance is described by Elon Musk as close to, or perhaps exceeding, Anthropic's Claude 3 Opus, signaling a significant capability upgrade for xAI's suite of products.