BACK_TO_FEEDAICRIER_2
Qwen3.5-4B quants favor Q5_K_M, Q6_K
OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoBENCHMARK RESULT

Qwen3.5-4B quants favor Q5_K_M, Q6_K

This benchmark compares a wide range of Qwen3.5-4B GGUF quants on an Intel Lunar Lake laptop with 18GB of memory, measuring both token throughput and KLD against a BF16 reference. The results show a clear practical sweet spot around Q5_K_M and Q6_K: those quants keep KLD very low while still running in the low-20s tok/s, while Q8_0 is the quality ceiling but gives up a noticeable amount of speed. The post also suggests that uploader and quantization method matter, since the same nominal quant can land at meaningfully different quality scores across builds.

// ANALYSIS

Hot take: on this machine, “best” is not the smallest quant or the fastest quant, it’s the one that stays under roughly Q6 without wasting RAM on near-lossless accuracy you probably won’t feel in chat.

  • Q5_K_M is the most balanced pick in this dataset: strong quality, still fast enough to feel responsive, and notably better KLD than most Q4 variants.
  • Q6_K looks like the quality-first sweet spot if you can tolerate dropping into the ~20 tok/s range.
  • Q8_0 is effectively the accuracy ceiling here, but the speed penalty makes it hard to justify unless you care about fidelity more than latency.
  • The spread between uploaders is real: for the same quant label, KLD can vary enough to change the recommendation.
  • The data is useful for this laptop class, but I would be cautious about extrapolating directly to larger models or different memory-bandwidth-limited systems.
// TAGS
qwenggufquantizationllama.cppbenchmarklunar-lakeintelkld

DISCOVERED

6d ago

2026-04-06

PUBLISHED

6d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

Tryshea