OPEN_SOURCE ↗
REDDIT · REDDIT// 7d agoBENCHMARK RESULT
Qwen 3.5 edges Gemma 4 in blind eval
A Reddit user ran a 30-question blind, single-judge benchmark pitting Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B against each other with Claude Opus 4.6 as judge. Qwen won the most questions, while Gemma 4 31B matched the MoE variant on average score and looked steadier on communication tasks.
// ANALYSIS
The signal is real but messy: Qwen looks strongest on raw capability when it stays on rails, while Gemma 4 looks more consistent and the MoE variant shows that reliability, not peak score, may be the bigger product gap.
- –Qwen 3.5 27B took 14 of 30 question wins, especially in reasoning and analysis, but three 0.0 scores make the average hard to trust without a failure-rate lens.
- –Gemma 4 31B’s 8.82 average matched Gemma 4 26B-A4B exactly, which is a good sign for the MoE variant when it doesn’t error out.
- –The MoE model’s 2 outright failures are a bigger practical issue than its average score suggests, especially for local users who care about completion rate.
- –The long Gemma 4 31B latency spikes are notable because they did not obviously buy higher scores, which points to inference efficiency rather than pure capability as the next optimization target.
- –Single-judge absolute scoring reduces some pairwise bias, but it still leaves rubric drift, verbosity effects, and judge personality as unresolved confounders.
// TAGS
benchmarkllmreasoninggemma-4qwen-3-5claude-opus-4-6
DISCOVERED
7d ago
2026-04-05
PUBLISHED
7d ago
2026-04-05
RELEVANCE
8/ 10
AUTHOR
Silver_Raspberry_811