YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Notte benchmark exposes agent execution tax

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Notte benchmark exposes agent execution tax
OPEN LINK ↗
// 1h agoBENCHMARK RESULT

Notte benchmark exposes agent execution tax

Notte and Fireworks AI ran 720 WebVoyager browser-agent tasks across four models and found that retries, not raw token price, decide what a task really costs. MiniMax M2.5 came out 2.3x cheaper per successful task than Gemini 2.5 Flash, while GLM-5 posted the best accuracy and Kimi K2.5 showed zero parse retries in the instrumented runs.

// ANALYSIS

The useful takeaway here is procurement, not bragging rights: once an agent loop starts retrying malformed outputs, “cheap” token pricing becomes a mirage.

  • Gemini 2.5 Flash had the highest execution tax at 22.9%, which is exactly the kind of hidden waste that compounds in production browser workflows
  • MiniMax M2.5 looks like the best default on this benchmark because it combines top-tier accuracy with the lowest cost per successful task
  • GLM-5 is the safer pick when task correctness matters more than latency, especially on structured or multi-step sites
  • Kimi K2.5’s zero parse retries is a strong signal that structured-output reliability can matter more than raw model intelligence in agent loops
  • The benchmark is text-only and tied to a specific WebVoyager setup, so the right conclusion is not “model X wins everything” but “measure cost per outcome, not cost per token”
// TAGS
nottebenchmarkweb-agentagentinferencepricingopen-weightsevaluation

DISCOVERED

1h ago

2026-05-21

PUBLISHED

2h ago

2026-05-21

RELEVANCE

9/ 10

AUTHOR

ogandrea