VulcanBench developer hits cost estimation bug

// 1d agoINFRASTRUCTURE

VulcanBench developer hits cost estimation bug

VulcanBench developer Morgan Linton flagged a bug in the project's cost estimation command (`vulcanbench estimate`) that leads to significant cost overestimation. The issue was discovered during a full 52-task suite run, where calculated estimates ranged from $90 to $150 instead of the actual $50 to $60.

// ANALYSIS

Pre-run cost estimation is crucial for teams running expensive LLM evaluation sweeps, but scaling logic must account for already-computed costs in local run histories. In VulcanBench's case, the estimator triple-counts the judge tokens because historical logs already store the full cost.

–The bug stems from `_task_judge_mult` unconditionally multiplying the indexed cost by 3.0 when judges are enabled, even though the stored `cost_usd` already includes the judge fee.
–Storing the base agent-only cost in the local runs index and scaling it dynamically would resolve the discrepancy.
–Reliable spend estimation is a key developer experience requirement for SWE benchmarking tools, as sweeps across massive reasoning models can quickly drain API budgets.
–This highlights the difficulty of maintaining accurate telemetry across compound AI workflows where execution costs are split between the main agent and verification judges.

// TAGS

vulcanbenchllmbenchmarkevaluationdevtoolopen-sourceai-codingcode-generation

DISCOVERED

1d ago

2026-06-24

PUBLISHED

1d ago

2026-06-24

RELEVANCE

8/ 10

AUTHOR

morganlinton

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS4h ago

LuaJIT 3.0 proposes modern syntax extensions

Mike Pall has proposed a set of modern syntax extensions for LuaJIT 3.0, introducing features like nil-coalescing, optional chaining, and compound assignment. These features aim to improve developer quality-of-life and will be backported to LuaJIT 2.1 to ease compiler bootstrapping.

NEWS4h ago

GLM-5.2 rivals Claude Opus 4.8

A coding comparison by developer Hassan (@nutlope) shows Z.ai's open-weights model GLM-5.2 matches Claude Opus 4.8 on frontend web tasks. While GLM-5.2 is more verbose, it achieves comparable design quality at a fraction of the cost.

RESEARCH5h ago

OpenAI details RL alignment generalization

OpenAI's latest alignment research demonstrates that training AI models on beneficial traits in a single domain, like healthcare, generalizes to completely unrelated tasks. This reinforcement learning approach improves performance on 80% of out-of-distribution safety benchmarks and increases resistance to adversarial jailbreaking.