BACK_TO_FEEDAICRIER_2
Meta-Harness beats Claude Code on TerminalBench-2
OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoRESEARCH PAPER

Meta-Harness beats Claude Code on TerminalBench-2

Stanford's Meta-Harness is an outer-loop system that uses a coding agent and full execution traces to iteratively rewrite LLM harness code. In TerminalBench-2, it outperforms Claude Code and other baselines while also posting gains on text classification and math reasoning.

// ANALYSIS

This is a strong reminder that for agentic systems, the harness can matter almost as much as the base model, and raw traces beat summary compression when failures are subtle.

  • The paper reports filesystem access to prior code, scores, and logs, with as much as 10M tokens of diagnostic context per step.
  • On TerminalBench-2, Meta-Harness reports 76.4% on Claude Opus 4.6 versus 58.0 for Claude Code, and 37.6% on Claude Haiku 4.5 versus 27.5 for Claude Code.
  • The same framework also improves text classification by 7.7 points over ACE while using 4x fewer context tokens, and lifts math reasoning by 4.7 points on average across five held-out models.
  • The interesting twist is that Claude Code is both part of the optimization loop and a benchmark entry, so the result is more about search and instrumentation than a simple model-vs-model bake-off.
// TAGS
meta-harnessclaude-coderesearchbenchmarkagentai-coding

DISCOVERED

12d ago

2026-03-30

PUBLISHED

12d ago

2026-03-30

RELEVANCE

9/ 10

AUTHOR

Tolopono