OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoBENCHMARK RESULT
LLM Debate Benchmark puts Sonnet 4.6 first
Lech Mazur's benchmark runs 10-turn debates twice with sides swapped, then ranks models with Bradley-Terry over judged matchups. In the current snapshot, Claude Sonnet 4.6 (high reasoning) leads overall, GLM-5 is the top open-weights model, and Xiaomi MiMo V2 Pro is the clearest content-block outlier.
// ANALYSIS
This is one of the cleaner "debate as eval" designs out there, because it controls for side bias and scores full argument arcs instead of one-shot answers. It still measures a very specific skill, though: sustained adversarial persuasion on contentious topics.
- –Side swapping is the big methodological win; it keeps the leaderboard from rewarding whichever side the model happened to get.
- –Bradley-Terry over paired outcomes is a better fit than raw judge averages, especially when the judges are themselves LLMs.
- –The frontier is crowded, so Sonnet 4.6's lead reads more like a narrow current edge than a permanent gap.
- –GLM-5 leading the open-weights pack is the practical signal for teams that want strong debate behavior without closed-model dependency.
- –MiMo V2 Pro's 10.4% content-block rate shows that operational fragility can matter as much as rhetorical strength.
// TAGS
llm-debate-benchmarkbenchmarkllmreasoningopen-weightsopen-sourceresearch
DISCOVERED
19d ago
2026-03-23
PUBLISHED
19d ago
2026-03-23
RELEVANCE
9/ 10
AUTHOR
zero0_one1