llama.cpp KV cache quantization shifts KLD across models

// 110d agoBENCHMARK RESULT

llama.cpp KV cache quantization shifts KLD across models

Velocita84 benchmarked eight llama.cpp KV cache quantization modes across Qwen3.5 9B, Qwen3 VL 8B, Gemma 3 12B, Ministral 3 8B, and Irix 12B using wikitext-2 plus a 32k-token conversation. The numbers are noisy because the reference logits came from an IQ4_XS base model, but they still show that KV compression sensitivity varies widely by model.

// ANALYSIS

This is a useful directional benchmark, not a clean verdict. The real story is that KV cache quantization is model-family-specific, and `Qwen3 VL` looks like the warning label.

–`wikitext-2` with the default 512-token window is a blunt proxy; the longer-context run is more relevant to real local inference stress.
–`llama-perplexity` only scores the latter half of each context window, so assistant and tool-call behavior are still underrepresented.
–Because the baseline logits were generated from `IQ4_XS`, the results are best read as relative drift from KV changes, not absolute bf16 quality loss.
–The Bartowski vs Unsloth note suggests upstream model quantization can confound cache-only comparisons.
–`Qwen3 VL` looks like the warning label, suggesting multimodal models may be less forgiving of aggressive KV compression.

// TAGS

llama-cppllminferencebenchmarkmultimodalgpu

DISCOVERED

110d ago

2026-03-23

PUBLISHED

110d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

Velocita84

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL1h ago

Reve 2.1 drops native 4K rendering

Reve has released version 2.1 of its creative image generation model, introducing native 4K rendering, object-level editing, and a new "Live Layers" feature. The update enables users to perform localized edits and manage layouts directly, catering to professional design workflows requiring precise control.

RESEARCH1h ago

UCSD researchers successfully demonstrate the first in-vivo teleoperated surgical procedures using general-purpose humanoid robots.

Researchers at the University of California San Diego (UCSD) have achieved a milestone in medical robotics by using Unitree G1 general-purpose humanoid robots (nicknamed "Surgie") to perform laparoscopic gallbladder removals on live animal subjects. The study, published in Nature, evaluated a teleoperated humanoid platform that utilizes standard surgical instruments via custom-made hand adapters. In the trials, the researchers successfully demonstrated both human-robot teams (a humanoid operated by a teleoperator assisting a human surgeon) and robot-robot teams (two humanoids working cooperatively) to complete the surgical tasks. This research indicates that while humanoid platforms are currently slower and less precise than specialized systems like the da Vinci, they offer a far more compact, versatile, and cost-effective alternative that could expand surgical access to remote, rural, or emergency settings.

OPEN SOURCE1h ago

ABot-World simulates infinite 720p worlds on single GPU

ABot-World is an open-source, action-conditioned infinite world simulator designed to generate interactive 720p environments at 16 frames per second with low latency on a single desktop GPU. By utilizing an NVIDIA RTX 5090 and requiring just 19GB of GPU memory, this embodied world model offers physical compliance, action controllability, and zero-shot generalization, making real-time, interactive environment simulation accessible on consumer-grade hardware.