YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Gemma 4 E4B crawls on RTX 5070 Ti

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Gemma 4 E4B crawls on RTX 5070 Ti
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Gemma 4 E4B crawls on RTX 5070 Ti

A Reddit user reports Gemma 4 E4B running through llama.cpp at only about 5 tokens/sec on an RTX 5070 Ti laptop with 12GB VRAM, even though prompt processing is fast. The post is a troubleshooting plea for better local inference settings on Gemma 4 and, potentially, larger Gemma 26B variants.

// ANALYSIS

This looks more like a decode-path bottleneck than a broken model: prompt ingestion is screaming, but generation slows to a crawl once KV-cache pressure and context size kick in.

  • Google positions Gemma 4 E4B as an edge/laptop-friendly model, but 12GB VRAM leaves very little headroom once you push a 16K context and quantized caches.
  • The split between 540 tok/s prompt eval and 5 tok/s generation strongly suggests the slowdown is happening in decode-time attention/KV handling, not raw model load.
  • The user’s own note that lowering `--ubatch-size` improved speed lines up with a memory-bandwidth tradeoff rather than a simple CPU/GPU underutilization problem.
  • `--cache-type-k/v q4_0`, `--mlock`, and `--no-mmap` may help fit the workload, but they can also make the runtime more conservative and memory-bound.
  • Gemma 4 E4B is a plausible local model for a gaming laptop; Gemma 4 26B is a very different ask and likely needs much looser context or a larger VRAM budget.
// TAGS
gemma-4llama.cppinferencegpubenchmarkopen-weightsmultimodal

DISCOVERED

45d ago

2026-04-17

PUBLISHED

45d ago

2026-04-16

RELEVANCE

8/ 10

AUTHOR

Plastic-Parsley3094