BACK_TO_FEEDAICRIER_2
Google TurboQuant claims 6x KV compression
OPEN_SOURCE ↗
REDDIT · REDDIT// 17d agoRESEARCH PAPER

Google TurboQuant claims 6x KV compression

Google Research’s TurboQuant is a new vector-quantization scheme aimed at shrinking KV caches and speeding long-context inference. Google says it can cut KV memory by at least 6x and push attention up to 8x on H100s, while the paper reports near-baseline accuracy at 3.5 bits per channel.

// ANALYSIS

The math looks real; the systems question is whether the compression survives the implementation tax.

  • The paper reports near-baseline LongBench and needle-in-a-haystack results on Llama-3.1-8B-Instruct, with 3.5-bit TurboQuant matching full-cache average score and 2.5-bit staying close.
  • Google’s blog headline numbers are strong, but the paper also describes a 2-4x faster mixed-precision fused kernel versus conventional floating-point GEMM, so the actual end-to-end gain depends on how well it gets fused into a serving stack.
  • The only concrete outside-paper implementation I found is an MLX port on Llama-3.2-3B claiming a 41.8% total KV-footprint reduction and 0.01s hot-swap latency, while also saying bit-packing/unpacking is the current bottleneck.
  • That makes TurboQuant especially interesting for local and edge inference with tight VRAM budgets; for production, the next proof point is a clean CUDA or Metal implementation that keeps the speedup after integration.
// TAGS
llminferencegpubenchmarkresearchturboquant

DISCOVERED

17d ago

2026-03-25

PUBLISHED

17d ago

2026-03-25

RELEVANCE

9/ 10

AUTHOR

SelectionCalm70