BACK_TO_FEEDAICRIER_2
Gemma 4 E4B crawls on RTX 5070 Ti
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Gemma 4 E4B crawls on RTX 5070 Ti

A Reddit user reports Gemma 4 E4B running through llama.cpp at only about 5 tokens/sec on an RTX 5070 Ti laptop with 12GB VRAM, even though prompt processing is fast. The post is a troubleshooting plea for better local inference settings on Gemma 4 and, potentially, larger Gemma 26B variants.

// ANALYSIS

This looks more like a decode-path bottleneck than a broken model: prompt ingestion is screaming, but generation slows to a crawl once KV-cache pressure and context size kick in.

  • Google positions Gemma 4 E4B as an edge/laptop-friendly model, but 12GB VRAM leaves very little headroom once you push a 16K context and quantized caches.
  • The split between 540 tok/s prompt eval and 5 tok/s generation strongly suggests the slowdown is happening in decode-time attention/KV handling, not raw model load.
  • The user’s own note that lowering `--ubatch-size` improved speed lines up with a memory-bandwidth tradeoff rather than a simple CPU/GPU underutilization problem.
  • `--cache-type-k/v q4_0`, `--mlock`, and `--no-mmap` may help fit the workload, but they can also make the runtime more conservative and memory-bound.
  • Gemma 4 E4B is a plausible local model for a gaming laptop; Gemma 4 26B is a very different ask and likely needs much looser context or a larger VRAM budget.
// TAGS
gemma-4llama.cppinferencegpubenchmarkopen-weightsmultimodal

DISCOVERED

3h ago

2026-04-17

PUBLISHED

18h ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

Plastic-Parsley3094