OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Gemma 4 E4B crawls on RTX 5070 Ti
A Reddit user reports Gemma 4 E4B running through llama.cpp at only about 5 tokens/sec on an RTX 5070 Ti laptop with 12GB VRAM, even though prompt processing is fast. The post is a troubleshooting plea for better local inference settings on Gemma 4 and, potentially, larger Gemma 26B variants.
// ANALYSIS
This looks more like a decode-path bottleneck than a broken model: prompt ingestion is screaming, but generation slows to a crawl once KV-cache pressure and context size kick in.
- –Google positions Gemma 4 E4B as an edge/laptop-friendly model, but 12GB VRAM leaves very little headroom once you push a 16K context and quantized caches.
- –The split between 540 tok/s prompt eval and 5 tok/s generation strongly suggests the slowdown is happening in decode-time attention/KV handling, not raw model load.
- –The user’s own note that lowering `--ubatch-size` improved speed lines up with a memory-bandwidth tradeoff rather than a simple CPU/GPU underutilization problem.
- –`--cache-type-k/v q4_0`, `--mlock`, and `--no-mmap` may help fit the workload, but they can also make the runtime more conservative and memory-bound.
- –Gemma 4 E4B is a plausible local model for a gaming laptop; Gemma 4 26B is a very different ask and likely needs much looser context or a larger VRAM budget.
// TAGS
gemma-4llama.cppinferencegpubenchmarkopen-weightsmultimodal
DISCOVERED
3h ago
2026-04-17
PUBLISHED
18h ago
2026-04-16
RELEVANCE
8/ 10
AUTHOR
Plastic-Parsley3094