Independent benchmarks question MiniMax M2.7 self-improvement claims
A user evaluated the newly released MiniMax M2.7 model against older models using "The Multivac" blind peer evaluation system. While single-turn Q&A results showed M2.7 trailing GPT-5.4 and tying the older M1, the author acknowledges the model's 30% self-improvement claims apply to multi-turn workflows and seeks better evaluation frameworks.
The discrepancy between vendor claims and independent benchmarks highlights the challenges of evaluating multi-turn, agentic models with traditional single-turn evals.
* M2.7 scored an average of 8.46 across 13 evaluations, virtually tying with the older M1 model (8.47) and well below GPT-5.4 (9.26).
* On targeted self-improvement tests, M2.7 tied for 1st in recursive optimization but fell short in iterative code improvement and debugging reasoning chains.
* The evaluator used external frontier judges (Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro) to reduce noise from same-family judging.
* Acknowledging the limitations of single-turn testing for an agentic model, the author is seeking community input on designing appropriate multi-turn evaluation harnesses.
DISCOVERED
21d ago
2026-03-22
PUBLISHED
21d ago
2026-03-21
RELEVANCE
AUTHOR
Silver_Raspberry_811