YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp KV cache quantization shifts KLD across models

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp KV cache quantization shifts KLD across models
OPEN LINK ↗
// 65d agoBENCHMARK RESULT

llama.cpp KV cache quantization shifts KLD across models

Velocita84 benchmarked eight llama.cpp KV cache quantization modes across Qwen3.5 9B, Qwen3 VL 8B, Gemma 3 12B, Ministral 3 8B, and Irix 12B using wikitext-2 plus a 32k-token conversation. The numbers are noisy because the reference logits came from an IQ4_XS base model, but they still show that KV compression sensitivity varies widely by model.

// ANALYSIS

This is a useful directional benchmark, not a clean verdict. The real story is that KV cache quantization is model-family-specific, and `Qwen3 VL` looks like the warning label.

  • `wikitext-2` with the default 512-token window is a blunt proxy; the longer-context run is more relevant to real local inference stress.
  • `llama-perplexity` only scores the latter half of each context window, so assistant and tool-call behavior are still underrepresented.
  • Because the baseline logits were generated from `IQ4_XS`, the results are best read as relative drift from KV changes, not absolute bf16 quality loss.
  • The Bartowski vs Unsloth note suggests upstream model quantization can confound cache-only comparisons.
  • `Qwen3 VL` looks like the warning label, suggesting multimodal models may be less forgiving of aggressive KV compression.
// TAGS
llama-cppllminferencebenchmarkmultimodalgpu

DISCOVERED

65d ago

2026-03-23

PUBLISHED

65d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

Velocita84