TurboQuant nabs 34 tok/s for 30B model on Mac
Google Research's TurboQuant algorithm enables 3-bit weight compression and fast inference on Apple Silicon via custom Metal kernels. It delivers a 42x speedup over fallbacks while maintaining significantly higher accuracy than standard 3-bit quantization.
TurboQuant represents a fundamental unlock for running large models on consumer hardware by solving the memory bottleneck in long-context sessions. Achieving 34 tok/s on a 30B model with a 48GB Mac puts flagship-level coding capabilities within reach of local developers. The scalar HIGGS algorithm's 3-bit compression eliminates the need for tedious calibration datasets, while performance gains over MLX's native quantization prove that theoretical rigor in kernel design pays massive dividends. While it excels in single-user decode, the current implementation's "dequant-per-forward tax" on prefill remains a target for future optimization.
DISCOVERED
3h ago
2026-04-19
PUBLISHED
6h ago
2026-04-18
RELEVANCE
AUTHOR
Varjoranta