OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT
Megaplan harness edges Opus on SWE-bench
Megaplan is a general-purpose planning and execution harness for LLMs, and its live SWE-bench dashboard shows open-weight models using the harness ahead of Claude Opus 4.5 on the benchmark line. At the time I checked, the experiment had 26 of 500 tasks scored, with 21 passes for an 80.8% pass rate.
// ANALYSIS
This is a harness story more than a model story: the claim is that structured planning, critique, gating, and review can unlock much better coding performance from open models than one-shot execution.
- –The live setup uses GLM-5.1 for prep, plan, execute, and review, with MiniMax-M2.7-highspeed handling critique and review, which is a concrete example of phase-specialized orchestration
- –The repo frames Megaplan as a reusable workflow layer, not a one-off benchmark script, which makes the result more interesting for agent builders than for raw model rankings
- –The result is still early and noisy: 26 scored tasks is a small slice of SWE-bench Verified, so the lead could move as the remaining 474 tasks come in
- –The fact that all code and data are public makes this unusually replicable for a leaderboard claim, which should help separate signal from hype
- –If the curve holds, this strengthens the case that better agent scaffolding can matter as much as marginal model gains on software engineering tasks
// TAGS
megaplanhermes-megaplanai-codingagentbenchmarkopen-source
DISCOVERED
8d ago
2026-04-04
PUBLISHED
8d ago
2026-04-04
RELEVANCE
9/ 10
AUTHOR
PetersOdyssey