Qwen3.6-35B-A3B hits 130 t/s on consumer GPUs
Alibaba's sparse MoE model, Qwen3.6-35B-A3B, is seeing rapid local adoption as developers optimize it for consumer hardware, reaching inference speeds up to 130 tokens per second on the RTX 3090. The model's efficiency and high coding performance are setting a new standard for open-weight models.
Qwen3.6-35B-A3B's MoE architecture is a major milestone for local AI, providing elite coding capability at speeds previously reserved for much smaller models. The sparse MoE with only 3B active parameters per token enables lightning-fast inference while maintaining 35B-class reasoning. Currently, IQ4 quantization offers the optimal tradeoff between speed and reasoning accuracy for local hardware, with performance gains of 10-15 t/s possible using specialized coding presets from Unsloth. Its 262K native context window and 73.4% SWE-bench score position it as a formidable local competitor to cloud models.
DISCOVERED
7h ago
2026-04-19
PUBLISHED
8h ago
2026-04-19
RELEVANCE
AUTHOR
cviperr33