TurboQuant lands on Android, cuts KV cache
A Reddit user says they built an Android KV-cache compression stack around Google Research’s TurboQuant ideas, combining PolarQuant-style rotations, Lloyd-Max quantization, compressed attention, and optional QJL residuals. The result is reportedly a 4-5x cache reduction versus FP16 while still running on mid-range phones and older 32-bit devices.
The useful insight here is that KV cache, not weights, is often the real limiter for on-device LLMs once you get past toy contexts. TurboQuant looks promising because it attacks that runtime memory growth directly, and the Android port suggests the method may matter more in constrained deployment than in datacenter benchmarks.
- –3-bit and 4-bit KV compression is the right tradeoff zone for mobile: 3-bit maximizes headroom, while 4-bit may be the safer default if attention quality matters more than raw memory savings
- –Compressed attention without full dequantization is the most interesting part of the implementation, because it removes one of the usual latency penalties of KV quantization
- –The scalar 32-bit fallback matters: supporting older ARM devices broadens the practical reach well beyond high-end phones
- –The real competitive question is whether QJL-style residuals are the best long-context fix, or whether simpler asymmetric K/V schemes and sparsity tricks will be easier to ship
- –If this gets open sourced, it could become a useful reference for mobile inference stacks that need long context without blowing RAM
DISCOVERED
11d ago
2026-03-31
PUBLISHED
11d ago
2026-03-31
RELEVANCE
AUTHOR
realaneesani