Smaller GGUF quants run slower on Qwen3.6
A Reddit user running Qwen3.6-35B-A3B in LM Studio and llama.cpp on a 3080 10GB plus Ryzen 5 3600 reports the counterintuitive result that Q4_K_XL is much faster than IQ_4_XS at the same settings, even though the IQ_4_XS file is smaller. The post asks why a lower-bitrate GGUF quant would deliver roughly half the tokens per second, and whether the bottleneck is the quantization format, GPU offload split, or MoE handling.
Hot take: smaller file size is not the same thing as faster inference, especially when the quant format changes and the workload is a sparse MoE model.
- –IQ_4_XS is an i-quant format, which uses a more complex importance-matrix-based scheme than standard K-quants; that can add dequantization overhead and hit less-optimized kernels in current llama.cpp builds.
- –Q4_K_XL may simply have better backend support and more efficient matmul paths, so it can outperform a “smaller” quant on real hardware.
- –With a 10GB 3080 and mixed CPU/GPU offload, throughput can be dominated by kernel efficiency, CPU-GPU traffic, and KV cache pressure rather than raw model file size.
- –For sparse MoE models, routing and expert placement can make performance especially non-intuitive; reducing bytes on disk does not guarantee fewer stalls during token generation.
- –The likely fix is to benchmark multiple quant families, not just smaller-vs-larger within one family, and to verify the exact llama.cpp / LM Studio build because quant-speed regressions are version-sensitive.
DISCOVERED
45d ago
2026-05-06
PUBLISHED
45d ago
2026-05-05
RELEVANCE
AUTHOR
quickreactor
