BACK_TO_FEEDAICRIER_2
llama.cpp Gemma 4 burns RAM on prompts
OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoINFRASTRUCTURE

llama.cpp Gemma 4 burns RAM on prompts

Users running Gemma 4 31B in llama.cpp report that long conversations can push system RAM into OOM even when VRAM still has headroom. The thread suggests the bottleneck is KV-cache and prompt-processing memory growth, not a straightforward model-weight load issue.

// ANALYSIS

This looks like the classic long-context trap: the model fits, but the running conversation state does not. `-ngl` can keep weights on GPU, yet it does not make the separate context-memory budget disappear.

  • `-c 102400` is huge, and even quantized KV cache settings still scale sharply with prompt length
  • Multiple users in the thread report the same RAM growth pattern, which makes this look systemic rather than machine-specific
  • Related llama.cpp issues show KV cache behavior can land in CPU RAM in CUDA setups, which matches the symptom users are seeing
  • The practical lesson is that context length is now a deployment constraint, not just a quality knob
  • If this is a regression, the useful bug report is the exact build, backend, and cache flags, since those likely determine where the memory goes
// TAGS
llama-cppllminferencegpuopen-sourceself-hosted

DISCOVERED

6d ago

2026-04-06

PUBLISHED

6d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

GregoryfromtheHood