OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoBENCHMARK RESULT
Gemma 4 speculative decoding hits 200 tok/s
A LocalLLaMA user reports that Gemma 4 31B paired with Gemma 4 E2B as a draft model in llama.cpp delivers roughly 130-200 tok/s on narrow, non-agentic tasks like extraction and classification. For Lithuanian workflows in the 2K-6K token range, they say it outperforms Gemini 2.5 Flash-lite while running fully local on an RTX 5090.
// ANALYSIS
This is a strong reminder that for structured, single-shot workloads, model architecture plus inference strategy can matter as much as raw benchmark scores. The caveat is that the gains are workload-specific, and the post is still an early sample rather than a scaled evaluation.
- –The win here is not just Gemma 4 itself, but speculative decoding with a much smaller draft model cutting latency on easy tokens
- –The use case is narrow and favorable: non-agentic, mostly extract/classify/rewrite jobs with short-to-mid contexts
- –The reported 31.5 GB VRAM footprint makes this a serious local option for high-end consumer GPUs, but only barely
- –The Lithuanian-language angle matters: open models with strong multilingual quality can be more valuable than cloud APIs when your target language is underrepresented
- –The real test is failure rate and consistency at scale, especially for JSON-heavy outputs and longer production runs
// TAGS
gemma-4llama-cppspeculative-decodingbenchmarkinferenceself-hostedopen-weightsllm
DISCOVERED
6h ago
2026-04-26
PUBLISHED
9h ago
2026-04-26
RELEVANCE
8/ 10
AUTHOR
Clasyc