BACK_TO_FEEDAICRIER_2
Gemma 4, Qwen 3.6 Diverge on KV Cache
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoBENCHMARK RESULT

Gemma 4, Qwen 3.6 Diverge on KV Cache

Localbench benchmarked Gemma 4 and Qwen 3.6 with f16, q8_0, and q4_0 KV cache settings to measure KL divergence against a full-precision baseline. The results show Gemma 4 degrades sharply under cache quantization, while Qwen 3.6 stays much closer to lossless behavior.

// ANALYSIS

The useful takeaway is blunt: q8_0 is not a safe default across all models, and Gemma 4 appears far more fragile than Qwen 3.6 when you squeeze the KV cache. That makes cache quantization a model-specific tuning problem, not a one-size-fits-all optimization.

  • Gemma 4 26B A4B is the standout outlier, with q8_0 cache damage much worse than the dense Gemma 31B and q4_0 becoming heavily degraded
  • Qwen 3.6 remains comparatively stable, with both dense and MoE variants staying low-KL at q8_0 and still usable at q4_0
  • The benchmark is practical, not theoretical: it measures token-level KL divergence across coding, chat, tool use, science, non-Latin scripts, and long documents
  • Cache quantization stacks with weight quantization, so a Q4 model plus q8_0 cache compounds quality loss rather than replacing it
  • The biggest pain shows up in long-context and tool-use scenarios, which are exactly where local inference users care about memory savings most
// TAGS
localbenchbenchmarkllminferencegemma-4qwen-3.6

DISCOVERED

5h ago

2026-04-24

PUBLISHED

7h ago

2026-04-24

RELEVANCE

9/ 10

AUTHOR

oobabooga4