BACK_TO_FEEDAICRIER_2
llama.cpp KV cache quantization shifts KLD across models
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoBENCHMARK RESULT

llama.cpp KV cache quantization shifts KLD across models

Velocita84 benchmarked eight llama.cpp KV cache quantization modes across Qwen3.5 9B, Qwen3 VL 8B, Gemma 3 12B, Ministral 3 8B, and Irix 12B using wikitext-2 plus a 32k-token conversation. The numbers are noisy because the reference logits came from an IQ4_XS base model, but they still show that KV compression sensitivity varies widely by model.

// ANALYSIS

This is a useful directional benchmark, not a clean verdict. The real story is that KV cache quantization is model-family-specific, and `Qwen3 VL` looks like the warning label.

  • `wikitext-2` with the default 512-token window is a blunt proxy; the longer-context run is more relevant to real local inference stress.
  • `llama-perplexity` only scores the latter half of each context window, so assistant and tool-call behavior are still underrepresented.
  • Because the baseline logits were generated from `IQ4_XS`, the results are best read as relative drift from KV changes, not absolute bf16 quality loss.
  • The Bartowski vs Unsloth note suggests upstream model quantization can confound cache-only comparisons.
  • `Qwen3 VL` looks like the warning label, suggesting multimodal models may be less forgiving of aggressive KV compression.
// TAGS
llama-cppllminferencebenchmarkmultimodalgpu

DISCOVERED

19d ago

2026-03-23

PUBLISHED

19d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

Velocita84