BACK_TO_FEEDAICRIER_2
TurboQuant-H squishes Gemma 4 embeddings to 2-bit
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoRESEARCH PAPER

TurboQuant-H squishes Gemma 4 embeddings to 2-bit

Cactus Compute has introduced TurboQuant-H, a 2-bit quantization technique for embedding layers specifically optimized for Gemma 4's "AltUp" architecture. By utilizing Hadamard rotations and Lloyd-Max codebooks, it shrinks a 5B parameter model from 4.8GB to 2.9GB with negligible perplexity loss, enabling sophisticated on-device AI on 4GB RAM mobile hardware.

// ANALYSIS

2-bit embeddings are a major win for on-device LLMs where bloated embedding tables often bottleneck deployment on consumer hardware.

  • Deterministic Hadamard rotations simplify the quantization pipeline compared to random orthogonal methods.
  • Achieves a 40% reduction in total model weight for Gemma 4 E2B with only a 0.06 increase in perplexity.
  • Enables large, reasoning-capable models to fit within the memory constraints of standard mobile and wearable devices.
  • No measured speed regression on inference, as butterfly factorization offsets the reduced memory bandwidth requirements.
// TAGS
llmembeddingedge-aiopen-sourceturboquant-h

DISCOVERED

3h ago

2026-04-22

PUBLISHED

4h ago

2026-04-22

RELEVANCE

8/ 10

AUTHOR

Henrie_the_dreamer