OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoRESEARCH PAPER
TurboQuant-H squishes Gemma 4 embeddings to 2-bit
Cactus Compute has introduced TurboQuant-H, a 2-bit quantization technique for embedding layers specifically optimized for Gemma 4's "AltUp" architecture. By utilizing Hadamard rotations and Lloyd-Max codebooks, it shrinks a 5B parameter model from 4.8GB to 2.9GB with negligible perplexity loss, enabling sophisticated on-device AI on 4GB RAM mobile hardware.
// ANALYSIS
2-bit embeddings are a major win for on-device LLMs where bloated embedding tables often bottleneck deployment on consumer hardware.
- –Deterministic Hadamard rotations simplify the quantization pipeline compared to random orthogonal methods.
- –Achieves a 40% reduction in total model weight for Gemma 4 E2B with only a 0.06 increase in perplexity.
- –Enables large, reasoning-capable models to fit within the memory constraints of standard mobile and wearable devices.
- –No measured speed regression on inference, as butterfly factorization offsets the reduced memory bandwidth requirements.
// TAGS
llmembeddingedge-aiopen-sourceturboquant-h
DISCOVERED
3h ago
2026-04-22
PUBLISHED
4h ago
2026-04-22
RELEVANCE
8/ 10
AUTHOR
Henrie_the_dreamer