YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen 3 32B tops Qwen 3.5 blind evals

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen 3 32B tops Qwen 3.5 blind evals
OPEN LINK ↗
// 71d agoBENCHMARK RESULT

Qwen 3 32B tops Qwen 3.5 blind evals

A community benchmark in r/LocalLLaMA tested eight Qwen 3 and 3.5 models across 11 reasoning and coding tasks, with Qwen 3 32B taking the top average score while Qwen 3.5 35B-A3B matched the flagship in wins at much better speed. The author also highlights key caveats: only 412 of 704 judgments were valid, judge strictness split by generation, and Qwen 3 32B only appeared in 6 of 11 evals due to API failures.

// ANALYSIS

Hot take: this is less a clean "dense beats MoE" verdict and more a reminder that eval design and latency constraints can outweigh raw model scale in local workflows.

  • Qwen 3 32B leading despite partial coverage suggests real strength, but the missing 5 evals can skew the final ordering.
  • Qwen 3.5 35B-A3B is the operational standout for local users, combining near-top quality with much stronger score-per-second.
  • Qwen 3 Coder Next losing coding-heavy tasks to general models supports a broader trend that specialized labels do not always translate to better real-world debugging performance.
  • With a 41.5% invalid-judgment rate and clear judge calibration drift between generations, ranks 3 to 5 should be treated as within noise unless replicated.
// TAGS
qwen3llmbenchmarkreasoningai-codingopen-weightsinference

DISCOVERED

71d ago

2026-03-17

PUBLISHED

71d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

Silver_Raspberry_811