OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Smaller GGUF quants run slower on Qwen3.6
A Reddit user running Qwen3.6-35B-A3B in LM Studio and llama.cpp on a 3080 10GB plus Ryzen 5 3600 reports the counterintuitive result that Q4_K_XL is much faster than IQ_4_XS at the same settings, even though the IQ_4_XS file is smaller. The post asks why a lower-bitrate GGUF quant would deliver roughly half the tokens per second, and whether the bottleneck is the quantization format, GPU offload split, or MoE handling.
// ANALYSIS
Hot take: smaller file size is not the same thing as faster inference, especially when the quant format changes and the workload is a sparse MoE model.
- –IQ_4_XS is an i-quant format, which uses a more complex importance-matrix-based scheme than standard K-quants; that can add dequantization overhead and hit less-optimized kernels in current llama.cpp builds.
- –Q4_K_XL may simply have better backend support and more efficient matmul paths, so it can outperform a “smaller” quant on real hardware.
- –With a 10GB 3080 and mixed CPU/GPU offload, throughput can be dominated by kernel efficiency, CPU-GPU traffic, and KV cache pressure rather than raw model file size.
- –For sparse MoE models, routing and expert placement can make performance especially non-intuitive; reducing bytes on disk does not guarantee fewer stalls during token generation.
- –The likely fix is to benchmark multiple quant families, not just smaller-vs-larger within one family, and to verify the exact llama.cpp / LM Studio build because quant-speed regressions are version-sensitive.
// TAGS
qwenquantizationllama-cpplm-studiomoelocal-firstperformance
DISCOVERED
3h ago
2026-05-06
PUBLISHED
5h ago
2026-05-05
RELEVANCE
6/ 10
AUTHOR
quickreactor