Evaluating AGENTS.md cuts coding-agent win rates

// 82d agoRESEARCH PAPER

Evaluating AGENTS.md cuts coding-agent win rates

This paper benchmarks repository-level instruction files like AGENTS.md and CLAUDE.md across multiple coding agents, finding they usually raise inference cost by more than 20% while slightly hurting task success. The practical takeaway for AI-heavy dev teams is blunt: keep repo guidance minimal, specific, and focused on non-obvious constraints instead of restating what the codebase already says.

// ANALYSIS

This is a useful reality check for the cargo cult around giant repo instruction files. The paper does not say context files are useless; it says autogenerated markdown summaries often add noise, while concise human-written constraints can still help.

–The authors introduce AGENTbench, a new benchmark built from 138 real GitHub issues across 12 repositories that already contain developer-written context files
–LLM-generated context files lowered success rates on average and increased steps, testing, and file exploration, which translated into materially higher token spend
–Human-written context files performed better than autogenerated ones, but the gains were modest and inconsistent across models, so quality matters more than file existence
–The strongest evidence is behavioral: agents really do follow these files, which means bad or redundant instructions can actively drag them into extra work
–Hacker News discussion around the paper converged on the same practical lesson: use AGENTS.md for tribal knowledge, workflow constraints, and non-obvious gotchas, not repo summaries

// TAGS

evaluating-agents-mdai-codingagentresearchbenchmarkdevtool

DISCOVERED

82d ago

2026-03-06

PUBLISHED

82d ago

2026-03-06

RELEVANCE

9/ 10

AUTHOR

Theo - t3․gg

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE1h ago

Grok Build widens access, adds subagents

xAI’s Grok Build is an early-beta terminal coding agent with plan-review-approve flows, parallel subagents, worktree isolation, and support for plugins, hooks, skills, and MCP. The latest improvements make it feel less like a demo and more like xAI’s bid to compete seriously in the AI coding CLI race.

MODEL1h ago

Krea 2 lands on Replicate

Krea 2 is now available on Replicate, giving developers access to Krea's style-first image model outside the Krea app. It emphasizes aesthetic diversity, style control, and reference-driven creative workflows.

MODEL1h ago

ElevenLabs launches Music v2 for creators

ElevenLabs has released Music v2, a new music generation model that improves vocals, instrumentation, arrangement, and multilingual output. The model supports longer, section-by-section composition, inpainting to regenerate specific parts of a track, and more complex shifts within a song without losing coherence. It powers ElevenMusic and ElevenCreative now, with ElevenAPI access coming soon, and is trained on licensed data for commercial use.