Gemma 4, Qwen 3.6 Diverge on KV Cache

// 45d agoBENCHMARK RESULT

Gemma 4, Qwen 3.6 Diverge on KV Cache

Localbench benchmarked Gemma 4 and Qwen 3.6 with f16, q8_0, and q4_0 KV cache settings to measure KL divergence against a full-precision baseline. The results show Gemma 4 degrades sharply under cache quantization, while Qwen 3.6 stays much closer to lossless behavior.

// ANALYSIS

The useful takeaway is blunt: q8_0 is not a safe default across all models, and Gemma 4 appears far more fragile than Qwen 3.6 when you squeeze the KV cache. That makes cache quantization a model-specific tuning problem, not a one-size-fits-all optimization.

–Gemma 4 26B A4B is the standout outlier, with q8_0 cache damage much worse than the dense Gemma 31B and q4_0 becoming heavily degraded
–Qwen 3.6 remains comparatively stable, with both dense and MoE variants staying low-KL at q8_0 and still usable at q4_0
–The benchmark is practical, not theoretical: it measures token-level KL divergence across coding, chat, tool use, science, non-Latin scripts, and long documents
–Cache quantization stacks with weight quantization, so a Q4 model plus q8_0 cache compounds quality loss rather than replacing it
–The biggest pain shows up in long-context and tool-use scenarios, which are exactly where local inference users care about memory savings most

// TAGS

localbenchbenchmarkllminferencegemma-4qwen-3.6

DISCOVERED

45d ago

2026-04-24

PUBLISHED

45d ago

2026-04-24

RELEVANCE

9/ 10

AUTHOR

oobabooga4

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS29m ago

The rise of terminal-native agentic coding tools introduces major security vectors that developers must proactively defend against.

The rise of terminal-native agentic coding tools, highlighted by Anthropic's Claude Code, presents a dual-use challenge for developers. While these tools greatly accelerate software development, they also serve as a potential 10x entry vector for attackers. The underlying security risk stems from their ability to execute terminal commands, manage files, and interact with the local filesystem, which requires developers to implement strict safety practices and sandboxing to prevent unauthorized execution or compromise.

NEWS52m ago

New Claude checkpoints Fable, Fruitcake leak

New internal model checkpoints from Anthropic, labeled "Claude Fable 5" and "Claude Fruitcake EAP," have reportedly been detected in active testing. This development highlights Anthropic's efforts to bridge the capability gap between its public models and its rumored internal powerhouses like Mythos Preview, indicating that new commercial or early-access versions of their AI may be on the horizon.

RESEARCH1h ago

Autonomous AI agents cut labor costs 94%

A research paper analyzing Perplexity production data shows that autonomous AI agents significantly expand task complexity while reducing labor costs by up to 94 percent. The authors propose an economic framework where agents lower marginal execution costs, shifting human effort toward verification and strategy.