LLM Debate Benchmark puts Sonnet 4.6 first

// 65d agoBENCHMARK RESULT

LLM Debate Benchmark puts Sonnet 4.6 first

Lech Mazur's benchmark runs 10-turn debates twice with sides swapped, then ranks models with Bradley-Terry over judged matchups. In the current snapshot, Claude Sonnet 4.6 (high reasoning) leads overall, GLM-5 is the top open-weights model, and Xiaomi MiMo V2 Pro is the clearest content-block outlier.

// ANALYSIS

This is one of the cleaner "debate as eval" designs out there, because it controls for side bias and scores full argument arcs instead of one-shot answers. It still measures a very specific skill, though: sustained adversarial persuasion on contentious topics.

–Side swapping is the big methodological win; it keeps the leaderboard from rewarding whichever side the model happened to get.
–Bradley-Terry over paired outcomes is a better fit than raw judge averages, especially when the judges are themselves LLMs.
–The frontier is crowded, so Sonnet 4.6's lead reads more like a narrow current edge than a permanent gap.
–GLM-5 leading the open-weights pack is the practical signal for teams that want strong debate behavior without closed-model dependency.
–MiMo V2 Pro's 10.4% content-block rate shows that operational fragility can matter as much as rhetorical strength.

// TAGS

llm-debate-benchmarkbenchmarkllmreasoningopen-weightsopen-sourceresearch

DISCOVERED

65d ago

2026-03-23

PUBLISHED

65d ago

2026-03-23

RELEVANCE

9/ 10

AUTHOR

zero0_one1

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS4m ago

Anthropic hits profitability as Claude Code usage surges

Anthropic achieved its first operating profit in Q2 2026, driven by a massive shift toward usage-based enterprise pricing. The company's agentic CLI, Claude Code, has become its primary revenue engine by consuming high volumes of tokens for autonomous coding tasks.

NEWS4m ago

Anthropic hits first profit on $10.9B Q2 revenue

Anthropic is poised to record its first operating profit in Q2 2026, driven by a massive $10.9 billion revenue run and a strategic pivot to enterprise sales. The financial turnaround highlights the explosive monetization potential of developer-focused coding agents like Claude Code.

OPEN SOURCE18m ago

Antirez adds distributed inference to DwarfStar

Salvatore Sanfilippo (antirez) has released a major update to DwarfStar, a specialized local inference engine designed for the DeepSeek V4 model family. The new "distributed inference" feature uses layer sharding to split massive models like the 284B DeepSeek V4 PRO across multiple networked machines, enabling frontier-level performance on a cluster of consumer-grade Macs or PCs.