Google strips MTP heads from public Gemma 4 weights
Google released Gemma 4 with Multi-Token Prediction (MTP) heads reserved exclusively for LiteRT runtimes, leaving public Hugging Face weights limited to standard autoregressive inference. This discovery has sparked debate over "open-washing" as the highest-performance version remains locked behind Google's proprietary ecosystem.
Google's decision to decouple MTP heads from public weights is a strategic move that prioritizes ecosystem control over true open-source parity. MTP enables 1.5x-2.0x faster inference through built-in speculative decoding, a major advantage for on-device AI. Stripping these heads from Hugging Face weights ensures that developers seeking maximum performance must use Google's LiteRT framework. While Google cites compatibility as the reason, the move creates a two-tier system that hinders third-party optimization in tools like llama.cpp. The community is already working on reverse-engineering the LiteRT models to stitch MTP support back into standard formats.
DISCOVERED
1d ago
2026-04-10
PUBLISHED
1d ago
2026-04-10
RELEVANCE
AUTHOR
FunSignificance4405