BACK_TO_FEEDAICRIER_2
Qwen 3 32B tops Qwen 3.5 blind evals
OPEN_SOURCE ↗
REDDIT · REDDIT// 26d agoBENCHMARK RESULT

Qwen 3 32B tops Qwen 3.5 blind evals

A community benchmark in r/LocalLLaMA tested eight Qwen 3 and 3.5 models across 11 reasoning and coding tasks, with Qwen 3 32B taking the top average score while Qwen 3.5 35B-A3B matched the flagship in wins at much better speed. The author also highlights key caveats: only 412 of 704 judgments were valid, judge strictness split by generation, and Qwen 3 32B only appeared in 6 of 11 evals due to API failures.

// ANALYSIS

Hot take: this is less a clean "dense beats MoE" verdict and more a reminder that eval design and latency constraints can outweigh raw model scale in local workflows.

  • Qwen 3 32B leading despite partial coverage suggests real strength, but the missing 5 evals can skew the final ordering.
  • Qwen 3.5 35B-A3B is the operational standout for local users, combining near-top quality with much stronger score-per-second.
  • Qwen 3 Coder Next losing coding-heavy tasks to general models supports a broader trend that specialized labels do not always translate to better real-world debugging performance.
  • With a 41.5% invalid-judgment rate and clear judge calibration drift between generations, ranks 3 to 5 should be treated as within noise unless replicated.
// TAGS
qwen3llmbenchmarkreasoningai-codingopen-weightsinference

DISCOVERED

26d ago

2026-03-17

PUBLISHED

26d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

Silver_Raspberry_811