Gemma 4 E4B crawls on RTX 5070 Ti

// 90d agoBENCHMARK RESULT

Gemma 4 E4B crawls on RTX 5070 Ti

A Reddit user reports Gemma 4 E4B running through llama.cpp at only about 5 tokens/sec on an RTX 5070 Ti laptop with 12GB VRAM, even though prompt processing is fast. The post is a troubleshooting plea for better local inference settings on Gemma 4 and, potentially, larger Gemma 26B variants.

// ANALYSIS

This looks more like a decode-path bottleneck than a broken model: prompt ingestion is screaming, but generation slows to a crawl once KV-cache pressure and context size kick in.

–Google positions Gemma 4 E4B as an edge/laptop-friendly model, but 12GB VRAM leaves very little headroom once you push a 16K context and quantized caches.
–The split between 540 tok/s prompt eval and 5 tok/s generation strongly suggests the slowdown is happening in decode-time attention/KV handling, not raw model load.
–The user’s own note that lowering `--ubatch-size` improved speed lines up with a memory-bandwidth tradeoff rather than a simple CPU/GPU underutilization problem.
–`--cache-type-k/v q4_0`, `--mlock`, and `--no-mmap` may help fit the workload, but they can also make the runtime more conservative and memory-bound.
–Gemma 4 E4B is a plausible local model for a gaming laptop; Gemma 4 26B is a very different ask and likely needs much looser context or a larger VRAM budget.

// TAGS

gemma-4llama.cppinferencegpubenchmarkopen-weightsmultimodal

DISCOVERED

90d ago

2026-04-17

PUBLISHED

91d ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

Plastic-Parsley3094

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE48m ago

Lightpanda agent REPL renders styled terminal markdown

Lightpanda has introduced a markdown-to-ANSI terminal renderer for its interactive agent REPL, styling headings, lists, inline formatting, and OSC 8 clickable links. The rendering is gated exclusively to interactive TTY sessions to avoid breaking machine-readable piped workflows.

VIDEO55m ago

Kimi K3 Teaser Hints at Hybrid Recurrent-Attention

Moonshot AI has released a teaser video for Kimi K3, prompting analysis of its architectural concepts. Visual metaphors in the video hint at a shift from Kimi K2's transformer backbone to a memory-efficient, recurrent hybrid architecture.

OPEN SOURCE1h ago

NextChat unifies Claude, DeepSeek, GPT-4, and Gemini Pro

NextChat (formerly ChatGPT-Next-Web) is a highly versatile, open-source AI client that provides a fast and unified interface for accessing top-tier LLMs like Claude, GPT-4, DeepSeek, and Gemini Pro. It is available across web, desktop, and iOS, features Model Context Protocol (MCP) support, and provides an enterprise edition with extensive brand customization options.