LLM Debate Benchmark adds GPT-5.5, Grok 4.3
The benchmark's May 4 update expands the public board with GPT-5.5, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, DeepSeek V4 Pro, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, Grok 4.3, and Mistral Medium 3.5 High Reasoning. Opus 4.7 still leads, while GPT-5.5 lands below GPT-5.4 and Grok 4.3 slips behind the older Grok 4.20 reasoning run.
This is a useful reminder that adversarial debate is a different skill from raw chat quality: the leaderboard is compressing in the middle, but the newest names are not automatically the strongest. The side-swapped setup and three-model judging make the result more credible than a single-pass preference test, even if judge agreement is only moderate.
- –GPT-5.5 entering below GPT-5.4 is the sharpest signal here: newer model, weaker debate showing
- –Grok 4.3 underperforming Grok 4.20 Beta 0309 suggests a real regression, not just leaderboard noise
- –GLM-5.1, Kimi K2.6, DeepSeek V4 Pro, and Xiaomi MiMo V2.5 Pro all look like solid incremental gains, but not a frontier break
- –The benchmark’s side-swapped format matters because motion asymmetry can otherwise overstate one model’s advantage
- –The entertainment diagnostic is a useful secondary lens, but Bradley-Terry ranking is the metric that should drive the headline
DISCOVERED
4h ago
2026-05-05
PUBLISHED
5h ago
2026-05-05
RELEVANCE
AUTHOR
zero0_one1