X · X// 3h agoOPENSOURCE RELEASE

Gemma 4 MTP drafters cut latency

Google is shipping Multi-Token Prediction drafters for Gemma 4, an open-source add-on that uses speculative decoding to speed up inference by up to 3x without changing output quality. The weights are available now under Apache 2.0 on Hugging Face and Kaggle.

// ANALYSIS

This is the kind of release that matters more than another benchmark chart: it attacks the real bottleneck in local and server-side LLM serving, not just the model score.

–Speculative decoding is a practical win because it reduces time spent shuttling weights from VRAM for every token
–The biggest beneficiaries are workstation and edge deployments, where latency and memory bandwidth are tighter constraints
–Google is making the optimization broadly usable by supporting common stacks like Transformers, vLLM, MLX, SGLang, and Ollama
–The “no quality degradation” claim is important: developers can trade less idle compute for faster responses without changing behavior
–This also strengthens Gemma 4’s positioning as an open-weight family meant for real deployment, not just demos

// TAGS

gemma-4llmopen-weightsinferenceopen-sourcegpu

DISCOVERED

3h ago

2026-05-05

PUBLISHED

4h ago

2026-05-05

RELEVANCE

9/ 10

AUTHOR

googleaidevs