OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoINFRASTRUCTURE
llama.cpp Gemma 4 burns RAM on prompts
Users running Gemma 4 31B in llama.cpp report that long conversations can push system RAM into OOM even when VRAM still has headroom. The thread suggests the bottleneck is KV-cache and prompt-processing memory growth, not a straightforward model-weight load issue.
// ANALYSIS
This looks like the classic long-context trap: the model fits, but the running conversation state does not. `-ngl` can keep weights on GPU, yet it does not make the separate context-memory budget disappear.
- –`-c 102400` is huge, and even quantized KV cache settings still scale sharply with prompt length
- –Multiple users in the thread report the same RAM growth pattern, which makes this look systemic rather than machine-specific
- –Related llama.cpp issues show KV cache behavior can land in CPU RAM in CUDA setups, which matches the symptom users are seeing
- –The practical lesson is that context length is now a deployment constraint, not just a quality knob
- –If this is a regression, the useful bug report is the exact build, backend, and cache flags, since those likely determine where the memory goes
// TAGS
llama-cppllminferencegpuopen-sourceself-hosted
DISCOVERED
6d ago
2026-04-06
PUBLISHED
6d ago
2026-04-06
RELEVANCE
8/ 10
AUTHOR
GregoryfromtheHood