Qwen3.6-27B benchmark favors weight quants
This Reddit post benchmarks llama.cpp quantization combinations for Qwen3.6-27B with an approximate KL-divergence proxy on Wikitext-2 at 16k context. The author concludes that weight quantization matters more than KV-cache quantization, so quantizing the cache can be worth it if it lets you move up a weight-quant tier, with q5_* looking safer than q4_0.
Hot take: this is a useful directional benchmark, and the direction is pretty clear even if the metric is approximate.
- –Q5 weight quants beat Q4 weight quants across the board, even when the Q4 setup keeps the KV cache in f16.
- –Quantizing the KV cache hurts less than dropping a model tier, so KV quantization is a reasonable trade if it unlocks a better weight quant.
- –Within the same tier, mixed KV settings still matter, but the delta is smaller than the gap between Q5 and Q4.
- –The strongest caveat is methodological: the KLD is approximated against Q5_K_M, not the full 16-bit model, so treat the numbers as comparative rather than absolute.
- –The test setup is narrow: Wikitext-2, 16k context, and one model family, so the conclusion should not be generalized too aggressively.
DISCOVERED
16d ago
2026-05-24
PUBLISHED
17d ago
2026-05-24
RELEVANCE
AUTHOR
hopbel