Meta-Harness beats Claude Code on TerminalBench-2

// 104d agoRESEARCH PAPER

Meta-Harness beats Claude Code on TerminalBench-2

Stanford's Meta-Harness is an outer-loop system that uses a coding agent and full execution traces to iteratively rewrite LLM harness code. In TerminalBench-2, it outperforms Claude Code and other baselines while also posting gains on text classification and math reasoning.

// ANALYSIS

This is a strong reminder that for agentic systems, the harness can matter almost as much as the base model, and raw traces beat summary compression when failures are subtle.

–The paper reports filesystem access to prior code, scores, and logs, with as much as 10M tokens of diagnostic context per step.
–On TerminalBench-2, Meta-Harness reports 76.4% on Claude Opus 4.6 versus 58.0 for Claude Code, and 37.6% on Claude Haiku 4.5 versus 27.5 for Claude Code.
–The same framework also improves text classification by 7.7 points over ACE while using 4x fewer context tokens, and lifts math reasoning by 4.7 points on average across five held-out models.
–The interesting twist is that Claude Code is both part of the optimization loop and a benchmark entry, so the result is more about search and instrumentation than a simple model-vs-model bake-off.

// TAGS

meta-harnessclaude-coderesearchbenchmarkagentai-coding

DISCOVERED

104d ago

2026-03-30

PUBLISHED

104d ago

2026-03-30

RELEVANCE

9/ 10

AUTHOR

Tolopono

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO1h ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE1h ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

NEWS3h ago

George Hotz shares his enthusiasm for LLMs and open-source coding agents while criticizing doom-mongering and the overinflated valuations of frontier AI labs.

George Hotz (geohot) details his excitement for the practical applications of AI—such as LLMs, self-driving cars, video generation models, and AI coding agents—highlighting his successful setup of the open-source agent OpenCode on a local GLM-5.2 model. However, he strongly criticizes the prevailing industry hype, safety-related doom-mongering, and the multibillion-dollar valuations of frontier AI labs. Hotz argues that frontier labs will fail to capture most of the AI value because AI is a commodity driven by Moore's law and general computing progress. He also frames coding models not as autonomous creators, but as valuable productivity tools analogous to compilers, find-and-replace, or Stack Overflow that are changing the nature of programming.