REDDIT · REDDIT// 6h agoBENCHMARK RESULT

Gemma 4 speculative decoding hits 200 tok/s

A LocalLLaMA user reports that Gemma 4 31B paired with Gemma 4 E2B as a draft model in llama.cpp delivers roughly 130-200 tok/s on narrow, non-agentic tasks like extraction and classification. For Lithuanian workflows in the 2K-6K token range, they say it outperforms Gemini 2.5 Flash-lite while running fully local on an RTX 5090.

// ANALYSIS

This is a strong reminder that for structured, single-shot workloads, model architecture plus inference strategy can matter as much as raw benchmark scores. The caveat is that the gains are workload-specific, and the post is still an early sample rather than a scaled evaluation.

–The win here is not just Gemma 4 itself, but speculative decoding with a much smaller draft model cutting latency on easy tokens
–The use case is narrow and favorable: non-agentic, mostly extract/classify/rewrite jobs with short-to-mid contexts
–The reported 31.5 GB VRAM footprint makes this a serious local option for high-end consumer GPUs, but only barely
–The Lithuanian-language angle matters: open models with strong multilingual quality can be more valuable than cloud APIs when your target language is underrepresented
–The real test is failure rate and consistency at scale, especially for JSON-heavy outputs and longer production runs

// TAGS

gemma-4llama-cppspeculative-decodingbenchmarkinferenceself-hostedopen-weightsllm

DISCOVERED

6h ago

2026-04-26

PUBLISHED

9h ago

2026-04-26

RELEVANCE

8/ 10

AUTHOR

Clasyc