BACK_TO_FEEDAICRIER_2
Smaller GGUF quants run slower on Qwen3.6
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Smaller GGUF quants run slower on Qwen3.6

A Reddit user running Qwen3.6-35B-A3B in LM Studio and llama.cpp on a 3080 10GB plus Ryzen 5 3600 reports the counterintuitive result that Q4_K_XL is much faster than IQ_4_XS at the same settings, even though the IQ_4_XS file is smaller. The post asks why a lower-bitrate GGUF quant would deliver roughly half the tokens per second, and whether the bottleneck is the quantization format, GPU offload split, or MoE handling.

// ANALYSIS

Hot take: smaller file size is not the same thing as faster inference, especially when the quant format changes and the workload is a sparse MoE model.

  • IQ_4_XS is an i-quant format, which uses a more complex importance-matrix-based scheme than standard K-quants; that can add dequantization overhead and hit less-optimized kernels in current llama.cpp builds.
  • Q4_K_XL may simply have better backend support and more efficient matmul paths, so it can outperform a “smaller” quant on real hardware.
  • With a 10GB 3080 and mixed CPU/GPU offload, throughput can be dominated by kernel efficiency, CPU-GPU traffic, and KV cache pressure rather than raw model file size.
  • For sparse MoE models, routing and expert placement can make performance especially non-intuitive; reducing bytes on disk does not guarantee fewer stalls during token generation.
  • The likely fix is to benchmark multiple quant families, not just smaller-vs-larger within one family, and to verify the exact llama.cpp / LM Studio build because quant-speed regressions are version-sensitive.
// TAGS
qwenquantizationllama-cpplm-studiomoelocal-firstperformance

DISCOVERED

3h ago

2026-05-06

PUBLISHED

5h ago

2026-05-05

RELEVANCE

6/ 10

AUTHOR

quickreactor