OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoBENCHMARK RESULT
SAW rotation wins KV cache sweep
A WikiText-2 perplexity sweep across Llama, Qwen, and Gemma models finds SAW-style KV rotation usually gives the best quality-memory tradeoff. The standout is that SAW tends to beat plain asymmetric quantization at 4-bit, while some methods still crash or break on specific model architectures.
// ANALYSIS
The takeaway is pretty blunt: if you want KV compression that actually survives contact with real models, SAW looks like the safest default, not TurboQuant-style 3-bit heroics.
- –On Llama 3.1 8B, `saw4k` roughly halves the quality hit versus `q4k` at similar KV savings, which is the kind of delta that matters in production
- –`saw8k` and `saw8kv` are close to lossless across several models, so 8-bit SAW looks like the low-risk choice when memory pressure is moderate
- –Qwen2.5 7B exposes a hard failure mode for 4-bit KV quantization because its K projection bias makes int4 noise swamp the signal
- –Gemma 4 shows that architecture matters more than slogans: some presets OOM or produce garbage PPL, so these methods are not universally portable
- –The benchmark makes a strong case for treating KV quantization as model-specific systems work, not a one-size-fits-all compression trick
// TAGS
kv-cachequantizationbenchmarkinferencelong-contextllmopen-source
DISCOVERED
1d ago
2026-05-02
PUBLISHED
1d ago
2026-05-01
RELEVANCE
8/ 10
AUTHOR
fredandlunchbox