REDDIT · REDDIT// 1d agoBENCHMARK RESULT

SAW rotation wins KV cache sweep

A WikiText-2 perplexity sweep across Llama, Qwen, and Gemma models finds SAW-style KV rotation usually gives the best quality-memory tradeoff. The standout is that SAW tends to beat plain asymmetric quantization at 4-bit, while some methods still crash or break on specific model architectures.

// ANALYSIS

The takeaway is pretty blunt: if you want KV compression that actually survives contact with real models, SAW looks like the safest default, not TurboQuant-style 3-bit heroics.

–On Llama 3.1 8B, `saw4k` roughly halves the quality hit versus `q4k` at similar KV savings, which is the kind of delta that matters in production
–`saw8k` and `saw8kv` are close to lossless across several models, so 8-bit SAW looks like the low-risk choice when memory pressure is moderate
–Qwen2.5 7B exposes a hard failure mode for 4-bit KV quantization because its K projection bias makes int4 noise swamp the signal
–Gemma 4 shows that architecture matters more than slogans: some presets OOM or produce garbage PPL, so these methods are not universally portable
–The benchmark makes a strong case for treating KV quantization as model-specific systems work, not a one-size-fits-all compression trick

// TAGS

kv-cachequantizationbenchmarkinferencelong-contextllmopen-source

DISCOVERED

1d ago

2026-05-02

PUBLISHED

1d ago

2026-05-01

RELEVANCE

8/ 10

AUTHOR

fredandlunchbox