REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Q4_K_XL edges Q4_K_M on Qwen3.6-35B-A3B

In a local inference test on an 8GB VRAM plus 32GB RAM machine, the Unsloth Q4_K_XL quant for Qwen3.6-35B-A3B came out a little ahead of Q4_K_M on speed, while also producing fewer output tokens on average. The posted five-run benchmark shows Q4_K_XL at 29.78 avg tokens/sec versus 28.92 for Q4_K_M, with 99.93s avg wall time versus 108.03s, even though Q4_K_XL used more memory. The author notes the first run includes startup/init time, so the comparison is realistic for an on/off local workflow rather than a pure warm-cache microbenchmark.

// ANALYSIS

Hot take: quant size alone does not determine end-to-end speed, especially on sparse MoE models where memory behavior and generation length can matter as much as raw throughput.

–The result is directionally plausible for a 35B MoE model: more memory usage can still buy slightly better throughput if it reduces bottlenecks elsewhere.
–The lower average output token count for Q4_K_XL likely contributes to the better wall-clock result, so this is not a pure apples-to-apples decoding-speed win.
–Including startup time makes the benchmark more representative of casual local use, but it also adds noise that can mask warm-run differences.
–The main takeaway is practical: on constrained local hardware, the “best” quant is not always the smallest one.

// TAGS

qwenqwen3.6unslothggufquantizationmoelocal-llmbenchmarkllama-cpp

DISCOVERED

4h ago

2026-04-26

PUBLISHED

6h ago

2026-04-26

RELEVANCE

9/ 10

AUTHOR

EggDroppedSoup