Qwen3.6-35B-A3B crowns 12GB sweet spot
A Reddit benchmark on an RTX 3060 12GB shows Qwen3.6-35B-A3B is surprisingly practical locally, especially with tuned `-ncmoe` and q8 KV cache. The author reports ~46-47 tok/s decode and says 32k context is usable without falling off the VRAM cliff.
The interesting part is not the raw speed, it’s that a 35B MoE model crosses from “theoretical” to “daily-driver” territory on 12GB if you tune offload carefully.
- –The sweet spot appears to be `-ncmoe 18-20`; pushing to `16` triggers a sharp performance cliff
- –q8 KV cache is effectively free here, so the usual memory-speed tradeoff leans toward higher-cache precision
- –Plain decoding already lands around 46 tok/s, which makes MTP only a marginal upgrade in this setup
- –The practical win is context: 16k to 32k feels achievable instead of being a benchmark-only configuration
- –For local coding, this is a better signal than headline benchmark charts because it reflects the real constraint: VRAM, not just FLOPS
DISCOVERED
4h ago
2026-05-09
PUBLISHED
7h ago
2026-05-08
RELEVANCE
AUTHOR
jwestra
