Q4_K_XL edges Q4_K_M on Qwen3.6-35B-A3B
In a local inference test on an 8GB VRAM plus 32GB RAM machine, the Unsloth Q4_K_XL quant for Qwen3.6-35B-A3B came out a little ahead of Q4_K_M on speed, while also producing fewer output tokens on average. The posted five-run benchmark shows Q4_K_XL at 29.78 avg tokens/sec versus 28.92 for Q4_K_M, with 99.93s avg wall time versus 108.03s, even though Q4_K_XL used more memory. The author notes the first run includes startup/init time, so the comparison is realistic for an on/off local workflow rather than a pure warm-cache microbenchmark.
Hot take: quant size alone does not determine end-to-end speed, especially on sparse MoE models where memory behavior and generation length can matter as much as raw throughput.
- –The result is directionally plausible for a 35B MoE model: more memory usage can still buy slightly better throughput if it reduces bottlenecks elsewhere.
- –The lower average output token count for Q4_K_XL likely contributes to the better wall-clock result, so this is not a pure apples-to-apples decoding-speed win.
- –Including startup time makes the benchmark more representative of casual local use, but it also adds noise that can mask warm-run differences.
- –The main takeaway is practical: on constrained local hardware, the “best” quant is not always the smallest one.
DISCOVERED
4h ago
2026-04-26
PUBLISHED
6h ago
2026-04-26
RELEVANCE
AUTHOR
EggDroppedSoup