Gemma 4 MTP hinges on acceptance
A LocalLLaMA user benchmarked Gemma 4’s MTP drafters on MLX-VLM and found the speedup is workload-sensitive. Code generation sped up materially, but prose barely broke even and JSON output got much slower when draft acceptance collapsed.
The takeaway is blunt: speculative decoding only pays when the draft model’s guesses are good enough, and structured-output workloads can wreck that math fast.
- –Code prompts hit a 66% draft accept rate and saw a 1.53x throughput gain
- –Long-form prose fell to a 31% accept rate and basically canceled out the overhead
- –JSON output dropped to 8% accept rate and ran about 2x slower with MTP enabled
- –The benchmark suggests a rough tipping point around 50% acceptance on this hardware/workload mix
- –This matters most for local developers, where every extra pass burns scarce Apple Silicon compute
DISCOVERED
4h ago
2026-05-09
PUBLISHED
6h ago
2026-05-08
RELEVANCE
AUTHOR
Hydroskeletal
