Llama.cpp asymmetric KV cache halves VRAM
A community evaluation found that mixing an 8-bit key cache with a 4-bit value cache in llama.cpp cuts memory usage in half for only a 1.3% precision loss. Developers are pushing to include this asymmetric configuration in default CUDA builds to prevent slow CPU fallbacks during prompt processing.
This is a massive efficiency unlock for developers trying to squeeze large-context models onto consumer GPUs.
- –High-precision keys (q8_0) preserve attention accuracy, while values tolerate heavy 4-bit quantization (q4_0)
- –Mixing `-ctk q8_0 -ctv q4_0` currently triggers a slow CPU fallback unless manually compiled with the exhaustive `FA_ALL_QUANTS` flag
- –Adding this specific combo to default builds would keep prompt processing on the GPU out of the box
- –Asymmetric KV quantization is rapidly becoming the standard trick for maximizing context lengths on local hardware
DISCOVERED
2h ago
2026-05-22
PUBLISHED
6h ago
2026-05-22
RELEVANCE
AUTHOR
Ueberlord