BACK_TO_FEEDAICRIER_2
llama.cpp hits 3-bit KV cache via TurboQuant
OPEN_SOURCE ↗
REDDIT · REDDIT// 11d agoOPENSOURCE RELEASE

llama.cpp hits 3-bit KV cache via TurboQuant

Google Research's TurboQuant algorithm has been integrated into llama.cpp, enabling 6x KV cache compression to 3.25 bits with zero accuracy loss. The breakthrough allows 30B parameter models like Nemotron to achieve 17 tokens/sec on consumer 8GB GPUs.

// ANALYSIS

TurboQuant eliminates quantization overhead by using Walsh-Hadamard rotations to create mathematically predictable distributions for Lloyd-Max quantizers. This enables 30B+ models to fit comfortably in 8GB VRAM alongside full 8k+ context windows, a feat previously impossible without massive speed degradation. The algorithm employs a unique two-stage approach—PolarQuant and QJL—that preserves attention precision significantly better than standard 4-bit INT quantization while reducing memory bandwidth bottlenecks. This delivers up to an 8x speedup in attention computation and remains model-agnostic, making it immediately applicable to any Transformer architecture including Llama, Mistral, and Gemma.

// TAGS
turboquantllama-cppllminferencegpuopen-source

DISCOVERED

11d ago

2026-04-01

PUBLISHED

11d ago

2026-03-31

RELEVANCE

10/ 10

AUTHOR

kvatrovit