Gemma 4 MTP drafters cut latency
Google is shipping Multi-Token Prediction drafters for Gemma 4, an open-source add-on that uses speculative decoding to speed up inference by up to 3x without changing output quality. The weights are available now under Apache 2.0 on Hugging Face and Kaggle.
This is the kind of release that matters more than another benchmark chart: it attacks the real bottleneck in local and server-side LLM serving, not just the model score.
- –Speculative decoding is a practical win because it reduces time spent shuttling weights from VRAM for every token
- –The biggest beneficiaries are workstation and edge deployments, where latency and memory bandwidth are tighter constraints
- –Google is making the optimization broadly usable by supporting common stacks like Transformers, vLLM, MLX, SGLang, and Ollama
- –The “no quality degradation” claim is important: developers can trade less idle compute for faster responses without changing behavior
- –This also strengthens Gemma 4’s positioning as an open-weight family meant for real deployment, not just demos
DISCOVERED
45d ago
2026-05-05
PUBLISHED
45d ago
2026-05-05
RELEVANCE
AUTHOR
googleaidevs