BACK_TO_FEEDAICRIER_2
Gemma 4 31B gets speculative boost
OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT

Gemma 4 31B gets speculative boost

A Reddit user reports that llama.cpp speculative decoding lifts Gemma 4 31B generation throughput on a 3090 from 32.9 t/s to 36.6 t/s when paired with Gemma 3 270m as the draft model. The gain is real but modest, with a roughly 44% draft acceptance rate.

// ANALYSIS

This is the kind of benchmark that matters more than glossy model scores: a small local inference win that translates directly into better interactive latency. It is not a breakthrough, but it is a useful reminder that draft-model choice and backend tuning still matter a lot.

  • The post’s “Gemma 3 270B” line appears to be a typo; the command uses `gemma-3-270m-it-GGUF`.
  • Prompt throughput is basically unchanged, so the benefit is coming from decoding, not prefill.
  • A 0.44 acceptance rate is respectable, but it also shows the draft model is only partially aligned with the target.
  • The result looks CUDA-friendly; the author says they saw little improvement on MI50s, so backend and hardware matter.
  • For local users, this is a practical optimization path: a tiny draft model plus a recent llama.cpp build can beat brute-force scaling for responsiveness.
// TAGS
gemma-4llama-cppllminferencebenchmark

DISCOVERED

8d ago

2026-04-04

PUBLISHED

8d ago

2026-04-04

RELEVANCE

8/ 10

AUTHOR

Leopold_Boom