OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoBENCHMARK RESULT
llama.cpp KV cache quantization shifts KLD across models
Velocita84 benchmarked eight llama.cpp KV cache quantization modes across Qwen3.5 9B, Qwen3 VL 8B, Gemma 3 12B, Ministral 3 8B, and Irix 12B using wikitext-2 plus a 32k-token conversation. The numbers are noisy because the reference logits came from an IQ4_XS base model, but they still show that KV compression sensitivity varies widely by model.
// ANALYSIS
This is a useful directional benchmark, not a clean verdict. The real story is that KV cache quantization is model-family-specific, and `Qwen3 VL` looks like the warning label.
- –`wikitext-2` with the default 512-token window is a blunt proxy; the longer-context run is more relevant to real local inference stress.
- –`llama-perplexity` only scores the latter half of each context window, so assistant and tool-call behavior are still underrepresented.
- –Because the baseline logits were generated from `IQ4_XS`, the results are best read as relative drift from KV changes, not absolute bf16 quality loss.
- –The Bartowski vs Unsloth note suggests upstream model quantization can confound cache-only comparisons.
- –`Qwen3 VL` looks like the warning label, suggesting multimodal models may be less forgiving of aggressive KV compression.
// TAGS
llama-cppllminferencebenchmarkmultimodalgpu
DISCOVERED
19d ago
2026-03-23
PUBLISHED
19d ago
2026-03-23
RELEVANCE
8/ 10
AUTHOR
Velocita84