BACK_TO_FEEDAICRIER_2
Qwen 3.5 edges Gemma 4 in blind eval
OPEN_SOURCE ↗
REDDIT · REDDIT// 7d agoBENCHMARK RESULT

Qwen 3.5 edges Gemma 4 in blind eval

A Reddit user ran a 30-question blind, single-judge benchmark pitting Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B against each other with Claude Opus 4.6 as judge. Qwen won the most questions, while Gemma 4 31B matched the MoE variant on average score and looked steadier on communication tasks.

// ANALYSIS

The signal is real but messy: Qwen looks strongest on raw capability when it stays on rails, while Gemma 4 looks more consistent and the MoE variant shows that reliability, not peak score, may be the bigger product gap.

  • Qwen 3.5 27B took 14 of 30 question wins, especially in reasoning and analysis, but three 0.0 scores make the average hard to trust without a failure-rate lens.
  • Gemma 4 31B’s 8.82 average matched Gemma 4 26B-A4B exactly, which is a good sign for the MoE variant when it doesn’t error out.
  • The MoE model’s 2 outright failures are a bigger practical issue than its average score suggests, especially for local users who care about completion rate.
  • The long Gemma 4 31B latency spikes are notable because they did not obviously buy higher scores, which points to inference efficiency rather than pure capability as the next optimization target.
  • Single-judge absolute scoring reduces some pairwise bias, but it still leaves rubric drift, verbosity effects, and judge personality as unresolved confounders.
// TAGS
benchmarkllmreasoninggemma-4qwen-3-5claude-opus-4-6

DISCOVERED

7d ago

2026-04-05

PUBLISHED

7d ago

2026-04-05

RELEVANCE

8/ 10

AUTHOR

Silver_Raspberry_811