OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT
Gemma 4 31B gets speculative boost
A Reddit user reports that llama.cpp speculative decoding lifts Gemma 4 31B generation throughput on a 3090 from 32.9 t/s to 36.6 t/s when paired with Gemma 3 270m as the draft model. The gain is real but modest, with a roughly 44% draft acceptance rate.
// ANALYSIS
This is the kind of benchmark that matters more than glossy model scores: a small local inference win that translates directly into better interactive latency. It is not a breakthrough, but it is a useful reminder that draft-model choice and backend tuning still matter a lot.
- –The post’s “Gemma 3 270B” line appears to be a typo; the command uses `gemma-3-270m-it-GGUF`.
- –Prompt throughput is basically unchanged, so the benefit is coming from decoding, not prefill.
- –A 0.44 acceptance rate is respectable, but it also shows the draft model is only partially aligned with the target.
- –The result looks CUDA-friendly; the author says they saw little improvement on MI50s, so backend and hardware matter.
- –For local users, this is a practical optimization path: a tiny draft model plus a recent llama.cpp build can beat brute-force scaling for responsiveness.
// TAGS
gemma-4llama-cppllminferencebenchmark
DISCOVERED
8d ago
2026-04-04
PUBLISHED
8d ago
2026-04-04
RELEVANCE
8/ 10
AUTHOR
Leopold_Boom