Google releases Gemma 4 QAT checkpoints
Google DeepMind has released official Quantization-Aware Training (QAT) checkpoints for the Gemma 4 model family on Hugging Face, integrating model compression directly into the training process. The release includes unquantized Q4_0 checkpoints, GGUF formats, a mobile-optimized wNa8o8 schema, and compressed tensors for native vLLM inference.
Post-training quantization is dead for high-stakes edge deployments; native QAT is now the baseline expectation for open-source LLM releases if developers want production-grade on-device performance without sacrificing accuracy.
- –**PTQ is a compromise:** Traditional post-training quantization destroys critical reasoning capability, whereas QAT preserves quality by simulating precision loss during the training process.
- –**Mobile-first architecture:** Introducing custom mobile-quantization schemas like wNa8o8 (with 2-bit decoding layers) shows that hardware-software co-design is essential for running larger models on mobile devices (e.g., shrinking Gemma 4 E2B down to a 1GB footprint).
- –**Ecosystem readiness:** Providing multiple ready-to-run formats (GGUF, compressed tensors, and Q4_0) ensures immediate adoption across a fragmented local inference ecosystem (vLLM, Ollama, llama.cpp, LiteRT-LM).
DISCOVERED
1h ago
2026-06-05
PUBLISHED
2h ago
2026-06-05
RELEVANCE
AUTHOR
googlegemma