Notte benchmark exposes agent execution tax
Notte and Fireworks AI ran 720 WebVoyager browser-agent tasks across four models and found that retries, not raw token price, decide what a task really costs. MiniMax M2.5 came out 2.3x cheaper per successful task than Gemini 2.5 Flash, while GLM-5 posted the best accuracy and Kimi K2.5 showed zero parse retries in the instrumented runs.
The useful takeaway here is procurement, not bragging rights: once an agent loop starts retrying malformed outputs, “cheap” token pricing becomes a mirage.
- –Gemini 2.5 Flash had the highest execution tax at 22.9%, which is exactly the kind of hidden waste that compounds in production browser workflows
- –MiniMax M2.5 looks like the best default on this benchmark because it combines top-tier accuracy with the lowest cost per successful task
- –GLM-5 is the safer pick when task correctness matters more than latency, especially on structured or multi-step sites
- –Kimi K2.5’s zero parse retries is a strong signal that structured-output reliability can matter more than raw model intelligence in agent loops
- –The benchmark is text-only and tied to a specific WebVoyager setup, so the right conclusion is not “model X wins everything” but “measure cost per outcome, not cost per token”
DISCOVERED
1h ago
2026-05-21
PUBLISHED
2h ago
2026-05-21
RELEVANCE
AUTHOR
ogandrea