Qwen 3 32B tops Qwen 3.5 blind evals

// 71d agoBENCHMARK RESULT

Qwen 3 32B tops Qwen 3.5 blind evals

ANNOUNCEMENT PRODUCT GITHUB PRODUCT HUNT

A community benchmark in r/LocalLLaMA tested eight Qwen 3 and 3.5 models across 11 reasoning and coding tasks, with Qwen 3 32B taking the top average score while Qwen 3.5 35B-A3B matched the flagship in wins at much better speed. The author also highlights key caveats: only 412 of 704 judgments were valid, judge strictness split by generation, and Qwen 3 32B only appeared in 6 of 11 evals due to API failures.

// ANALYSIS

Hot take: this is less a clean "dense beats MoE" verdict and more a reminder that eval design and latency constraints can outweigh raw model scale in local workflows.

–Qwen 3 32B leading despite partial coverage suggests real strength, but the missing 5 evals can skew the final ordering.
–Qwen 3.5 35B-A3B is the operational standout for local users, combining near-top quality with much stronger score-per-second.
–Qwen 3 Coder Next losing coding-heavy tasks to general models supports a broader trend that specialized labels do not always translate to better real-world debugging performance.
–With a 41.5% invalid-judgment rate and clear judge calibration drift between generations, ranks 3 to 5 should be treated as within noise unless replicated.

// TAGS

qwen3llmbenchmarkreasoningai-codingopen-weightsinference

DISCOVERED

71d ago

2026-03-17

PUBLISHED

71d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

Silver_Raspberry_811

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL20m ago

Prism ML launches Bonsai Image 4B variants

Prism ML has released Bonsai Image 4B, a compact text-to-image diffusion model family built from FLUX.2 Klein 4B for local inference on Apple Silicon and NVIDIA GPUs. The launch includes 1-bit and ternary variants, plus Bonsai Studio for trying the model on iPhone.

OPEN SOURCE26m ago

OpenMobius-skill packages ICT, SMC for agents

OpenMobius-skill turns ICT and smart money concepts into a reusable skill for Claude Code, Codex, OpenClaw, and Hermes, backed by 964 knowledge cards, live market data, and chart generation. Its 0.2.0 update on 2026-05-23 made the SMC structural indicator the default analysis path and added automatic overlays plus freshness disclosure.

OPEN SOURCE26m ago

Hallmark fights AI template sameness

Hallmark is an open-source design skill for Claude Code, Cursor, and Codex that pushes generated UIs away from samey, default-looking layouts. It varies macrostructure, theme, and layout, then runs style gates before handing work back.