Google drops TurboQuant for extreme LLM compression
TurboQuant is a new vector quantization algorithm from Google Research that enables 3-bit KV cache compression for LLMs with near-zero accuracy loss. By combining PolarQuant for MSE optimization and 1-bit QJL for unbiased inner product estimation, it achieves up to 8x performance gains in attention computation on H100 GPUs.
TurboQuant redefines the Pareto frontier for LLM efficiency, making massive context windows viable on memory-constrained hardware without typical accuracy trade-offs. PolarQuant uses random rotations to induce a concentrated Beta distribution for optimal scalar quantization, while a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform ensures unbiased results for similarity search. The data-oblivious design allows for seamless integration into GPU kernels, maintaining quality neutrality down to 3 bits per channel while reducing memory footprint by 6x and significantly outperforming existing product quantization methods.
DISCOVERED
18d ago
2026-03-25
PUBLISHED
18d ago
2026-03-24
RELEVANCE
AUTHOR
burnqubic