Qwen 3.5 edges Gemma 4 in blind eval

// 52d agoBENCHMARK RESULT

Qwen 3.5 edges Gemma 4 in blind eval

A Reddit user ran a 30-question blind, single-judge benchmark pitting Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B against each other with Claude Opus 4.6 as judge. Qwen won the most questions, while Gemma 4 31B matched the MoE variant on average score and looked steadier on communication tasks.

// ANALYSIS

The signal is real but messy: Qwen looks strongest on raw capability when it stays on rails, while Gemma 4 looks more consistent and the MoE variant shows that reliability, not peak score, may be the bigger product gap.

–Qwen 3.5 27B took 14 of 30 question wins, especially in reasoning and analysis, but three 0.0 scores make the average hard to trust without a failure-rate lens.
–Gemma 4 31B’s 8.82 average matched Gemma 4 26B-A4B exactly, which is a good sign for the MoE variant when it doesn’t error out.
–The MoE model’s 2 outright failures are a bigger practical issue than its average score suggests, especially for local users who care about completion rate.
–The long Gemma 4 31B latency spikes are notable because they did not obviously buy higher scores, which points to inference efficiency rather than pure capability as the next optimization target.
–Single-judge absolute scoring reduces some pairwise bias, but it still leaves rubric drift, verbosity effects, and judge personality as unresolved confounders.

// TAGS

benchmarkllmreasoninggemma-4qwen-3-5claude-opus-4-6

DISCOVERED

52d ago

2026-04-05

PUBLISHED

52d ago

2026-04-05

RELEVANCE

8/ 10

AUTHOR

Silver_Raspberry_811

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE6h ago

Cursor adds dedicated subagents for skills

Cursor now allows developers to execute tool-heavy or research-intensive agent skills within dedicated subagents. This architectural shift isolates noisy background tasks, keeping the main chat context clean and focused.

UPDATE7h ago

YouTube moves AI labels to video player

YouTube is moving its AI content disclosures from video descriptions to more prominent placements beneath the player and on Shorts overlays. Starting in May, the platform will use internal signals to automatically label photorealistic AI content that creators fail to disclose.

OPEN SOURCE10h ago

Taste Skill kills AI "frontend slop"

Taste-Skill is an open-source framework that provides portable "agent skills" to enforce high-end design principles in AI-generated code. By injecting specific design directives and "anti-slop" rules, it enables LLMs to produce editorial-grade UIs that bypass generic, boilerplate-heavy AI templates.