VulcanBench developer hits cost estimation bug
VulcanBench developer Morgan Linton flagged a bug in the project's cost estimation command (`vulcanbench estimate`) that leads to significant cost overestimation. The issue was discovered during a full 52-task suite run, where calculated estimates ranged from $90 to $150 instead of the actual $50 to $60.
Pre-run cost estimation is crucial for teams running expensive LLM evaluation sweeps, but scaling logic must account for already-computed costs in local run histories. In VulcanBench's case, the estimator triple-counts the judge tokens because historical logs already store the full cost.
- –The bug stems from `_task_judge_mult` unconditionally multiplying the indexed cost by 3.0 when judges are enabled, even though the stored `cost_usd` already includes the judge fee.
- –Storing the base agent-only cost in the local runs index and scaling it dynamically would resolve the discrepancy.
- –Reliable spend estimation is a key developer experience requirement for SWE benchmarking tools, as sweeps across massive reasoning models can quickly drain API budgets.
- –This highlights the difficulty of maintaining accurate telemetry across compound AI workflows where execution costs are split between the main agent and verification judges.
DISCOVERED
1d ago
2026-06-24
PUBLISHED
1d ago
2026-06-24
RELEVANCE
AUTHOR
morganlinton