OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoRESEARCH PAPER
TurboQuant rotates vectors, trims quantization bias
TurboQuant is Google's vector-quantization scheme for compressing KV caches and search embeddings. It randomly rotates vectors before low-bit quantization, then adds a 1-bit residual step to keep attention inner products unbiased and distortion near-optimal.
// ANALYSIS
TurboQuant is more than a “polar coordinates” trick; the real insight is a two-stage quantizer that makes low-bit compression behave on spiky transformer activations while preserving the math attention cares about.
- –Random rotation spreads signal across coordinates, so scalar quantization stops snapping quasi-sparse vectors onto a single dominant axis.
- –The 1-bit QJL residual is the underrated piece, because MSE-optimal quantizers can distort dot products and attention is built on dot products.
- –The paper points at inference infrastructure, not training: KV-cache compression and vector search are where training-free compression pays off.
- –Reported gains like 3.5 bits per channel with no quality loss, plus up to 8x attention-logit speedup on H100s, make it a serious systems paper rather than a curiosity.
- –The Google blog's polar-coordinate framing is a useful intuition, but it can overstate the geometry and understate the bias-correction step.
// TAGS
turboquantllminferencesearchvector-dbresearch
DISCOVERED
14d ago
2026-03-28
PUBLISHED
14d ago
2026-03-28
RELEVANCE
9/ 10
AUTHOR
-p-e-w-