OPEN_SOURCE ↗
YT · YOUTUBE// 36d agoRESEARCH PAPER
Evaluating AGENTS.md cuts coding-agent win rates
This paper benchmarks repository-level instruction files like AGENTS.md and CLAUDE.md across multiple coding agents, finding they usually raise inference cost by more than 20% while slightly hurting task success. The practical takeaway for AI-heavy dev teams is blunt: keep repo guidance minimal, specific, and focused on non-obvious constraints instead of restating what the codebase already says.
// ANALYSIS
This is a useful reality check for the cargo cult around giant repo instruction files. The paper does not say context files are useless; it says autogenerated markdown summaries often add noise, while concise human-written constraints can still help.
- –The authors introduce AGENTbench, a new benchmark built from 138 real GitHub issues across 12 repositories that already contain developer-written context files
- –LLM-generated context files lowered success rates on average and increased steps, testing, and file exploration, which translated into materially higher token spend
- –Human-written context files performed better than autogenerated ones, but the gains were modest and inconsistent across models, so quality matters more than file existence
- –The strongest evidence is behavioral: agents really do follow these files, which means bad or redundant instructions can actively drag them into extra work
- –Hacker News discussion around the paper converged on the same practical lesson: use AGENTS.md for tribal knowledge, workflow constraints, and non-obvious gotchas, not repo summaries
// TAGS
evaluating-agents-mdai-codingagentresearchbenchmarkdevtool
DISCOVERED
36d ago
2026-03-06
PUBLISHED
36d ago
2026-03-06
RELEVANCE
9/ 10
AUTHOR
Theo - t3․gg