llama.cpp hits 3-bit KV cache via TurboQuant
Google Research's TurboQuant algorithm has been integrated into llama.cpp, enabling 6x KV cache compression to 3.25 bits with zero accuracy loss. The breakthrough allows 30B parameter models like Nemotron to achieve 17 tokens/sec on consumer 8GB GPUs.
TurboQuant eliminates quantization overhead by using Walsh-Hadamard rotations to create mathematically predictable distributions for Lloyd-Max quantizers. This enables 30B+ models to fit comfortably in 8GB VRAM alongside full 8k+ context windows, a feat previously impossible without massive speed degradation. The algorithm employs a unique two-stage approach—PolarQuant and QJL—that preserves attention precision significantly better than standard 4-bit INT quantization while reducing memory bandwidth bottlenecks. This delivers up to an 8x speedup in attention computation and remains model-agnostic, making it immediately applicable to any Transformer architecture including Llama, Mistral, and Gemma.
DISCOVERED
11d ago
2026-04-01
PUBLISHED
11d ago
2026-03-31
RELEVANCE
AUTHOR
kvatrovit