Gemma 4 MTP speeds llama.cpp 40%
A community llama.cpp fork and GGUF drafter pack bring Gemma 4’s multi-token prediction into local inference. On a MacBook Pro M5 Max, the benchmark jumps from 97 tokens/s to 138 tokens/s on a Fibonacci prompt.
This is a real local-inference win, not just a flashy benchmark. The important part is that Gemma 4’s MTP path is now being pushed through a runnable llama.cpp fork with quantized drafter weights.
- –The measured gain is substantial: about 40% faster token generation on the cited setup and prompt
- –The repo links suggest this is not stock llama.cpp; it depends on custom MTP support and a patched runtime
- –Quantized assistant GGUFs lower the barrier for local testing, which matters more than raw headline speed
- –The result is still narrow: one prompt, one machine, one model family, so broader gains will depend on workload and draft acceptance
- –If upstream support matures, this could become a practical default for local Gemma 4 serving rather than a niche optimization
DISCOVERED
23h ago
2026-05-08
PUBLISHED
1d ago
2026-05-08
RELEVANCE
AUTHOR
gladkos