YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Gemma 4, Qwen 3.6 Diverge on KV Cache

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Gemma 4, Qwen 3.6 Diverge on KV Cache
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Gemma 4, Qwen 3.6 Diverge on KV Cache

Localbench benchmarked Gemma 4 and Qwen 3.6 with f16, q8_0, and q4_0 KV cache settings to measure KL divergence against a full-precision baseline. The results show Gemma 4 degrades sharply under cache quantization, while Qwen 3.6 stays much closer to lossless behavior.

// ANALYSIS

The useful takeaway is blunt: q8_0 is not a safe default across all models, and Gemma 4 appears far more fragile than Qwen 3.6 when you squeeze the KV cache. That makes cache quantization a model-specific tuning problem, not a one-size-fits-all optimization.

  • Gemma 4 26B A4B is the standout outlier, with q8_0 cache damage much worse than the dense Gemma 31B and q4_0 becoming heavily degraded
  • Qwen 3.6 remains comparatively stable, with both dense and MoE variants staying low-KL at q8_0 and still usable at q4_0
  • The benchmark is practical, not theoretical: it measures token-level KL divergence across coding, chat, tool use, science, non-Latin scripts, and long documents
  • Cache quantization stacks with weight quantization, so a Q4 model plus q8_0 cache compounds quality loss rather than replacing it
  • The biggest pain shows up in long-context and tool-use scenarios, which are exactly where local inference users care about memory savings most
// TAGS
localbenchbenchmarkllminferencegemma-4qwen-3.6

DISCOVERED

45d ago

2026-04-24

PUBLISHED

45d ago

2026-04-24

RELEVANCE

9/ 10

AUTHOR

oobabooga4