BACK_TO_FEEDAICRIER_2
Megaplan harness edges Opus on SWE-bench
OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT

Megaplan harness edges Opus on SWE-bench

Megaplan is a general-purpose planning and execution harness for LLMs, and its live SWE-bench dashboard shows open-weight models using the harness ahead of Claude Opus 4.5 on the benchmark line. At the time I checked, the experiment had 26 of 500 tasks scored, with 21 passes for an 80.8% pass rate.

// ANALYSIS

This is a harness story more than a model story: the claim is that structured planning, critique, gating, and review can unlock much better coding performance from open models than one-shot execution.

  • The live setup uses GLM-5.1 for prep, plan, execute, and review, with MiniMax-M2.7-highspeed handling critique and review, which is a concrete example of phase-specialized orchestration
  • The repo frames Megaplan as a reusable workflow layer, not a one-off benchmark script, which makes the result more interesting for agent builders than for raw model rankings
  • The result is still early and noisy: 26 scored tasks is a small slice of SWE-bench Verified, so the lead could move as the remaining 474 tasks come in
  • The fact that all code and data are public makes this unusually replicable for a leaderboard claim, which should help separate signal from hype
  • If the curve holds, this strengthens the case that better agent scaffolding can matter as much as marginal model gains on software engineering tasks
// TAGS
megaplanhermes-megaplanai-codingagentbenchmarkopen-source

DISCOVERED

8d ago

2026-04-04

PUBLISHED

8d ago

2026-04-04

RELEVANCE

9/ 10

AUTHOR

PetersOdyssey