OPEN_SOURCE ↗
REDDIT · REDDIT// 17d agoRESEARCH PAPER
Google TurboQuant claims 6x KV compression
Google Research’s TurboQuant is a new vector-quantization scheme aimed at shrinking KV caches and speeding long-context inference. Google says it can cut KV memory by at least 6x and push attention up to 8x on H100s, while the paper reports near-baseline accuracy at 3.5 bits per channel.
// ANALYSIS
The math looks real; the systems question is whether the compression survives the implementation tax.
- –The paper reports near-baseline LongBench and needle-in-a-haystack results on Llama-3.1-8B-Instruct, with 3.5-bit TurboQuant matching full-cache average score and 2.5-bit staying close.
- –Google’s blog headline numbers are strong, but the paper also describes a 2-4x faster mixed-precision fused kernel versus conventional floating-point GEMM, so the actual end-to-end gain depends on how well it gets fused into a serving stack.
- –The only concrete outside-paper implementation I found is an MLX port on Llama-3.2-3B claiming a 41.8% total KV-footprint reduction and 0.01s hot-swap latency, while also saying bit-packing/unpacking is the current bottleneck.
- –That makes TurboQuant especially interesting for local and edge inference with tight VRAM budgets; for production, the next proof point is a clean CUDA or Metal implementation that keeps the speedup after integration.
// TAGS
llminferencegpubenchmarkresearchturboquant
DISCOVERED
17d ago
2026-03-25
PUBLISHED
17d ago
2026-03-25
RELEVANCE
9/ 10
AUTHOR
SelectionCalm70