llama.cpp Gemma 4 burns RAM on prompts

// 98d agoINFRASTRUCTURE

llama.cpp Gemma 4 burns RAM on prompts

Users running Gemma 4 31B in llama.cpp report that long conversations can push system RAM into OOM even when VRAM still has headroom. The thread suggests the bottleneck is KV-cache and prompt-processing memory growth, not a straightforward model-weight load issue.

// ANALYSIS

This looks like the classic long-context trap: the model fits, but the running conversation state does not. `-ngl` can keep weights on GPU, yet it does not make the separate context-memory budget disappear.

–`-c 102400` is huge, and even quantized KV cache settings still scale sharply with prompt length
–Multiple users in the thread report the same RAM growth pattern, which makes this look systemic rather than machine-specific
–Related llama.cpp issues show KV cache behavior can land in CPU RAM in CUDA setups, which matches the symptom users are seeing
–The practical lesson is that context length is now a deployment constraint, not just a quality knob
–If this is a regression, the useful bug report is the exact build, backend, and cache flags, since those likely determine where the memory goes

// TAGS

llama-cppllminferencegpuopen-sourceself-hosted

DISCOVERED

98d ago

2026-04-06

PUBLISHED

98d ago

2026-04-06

RELEVANCE

8/ 10

AUTHOR

GregoryfromtheHood

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE46m ago

scroll-world launches scroll-driven 3D flight skill

scroll-world is an open-source, framework-agnostic agent skill that leverages Higgsfield to generate immersive, scroll-driven 3D camera flights through diorama scenes for landing pages. By rendering seamless connection clips between neighboring frames, it allows developers to build interactive 3D narrative websites navigated simply by scrolling, without requiring heavy game engines.

MODEL2h ago

OpenAI GPT-5.6 hits Amazon Bedrock

OpenAI's GPT-5.6 model family—including Sol, Terra, and Luna—is now generally available on Amazon Bedrock. Running on Bedrock's next-generation inference engine, the models support prompt caching with a 90% discount and match OpenAI's first-party pricing.

UPDATE2h ago

OpenRouter splits rankings by model weight

OpenRouter has updated its rankings platform by introducing separate leaderboards for open-weight and closed-weight models. This allows developers to track and compare usage statistics of proprietary, API-exclusive models against downloadable open-weight models.