llama.cpp fixes Gemma 4 VRAM bloat

// 99d agoPRODUCT UPDATE

llama.cpp fixes Gemma 4 VRAM bloat

Recent llama.cpp builds cut Gemma 4’s runaway KV-cache reservation, making the models far more practical to run locally. Users on Reddit report big context-length gains and a dramatic drop in VRAM usage without redownloading GGUFs.

// ANALYSIS

This is the kind of unglamorous runtime fix that determines whether a model feels usable or broken in practice.

–The win is not about raw model quality; it’s about memory accounting, which is often the difference between “runs” and “OOMs”
–Community reports suggest the fix landed in a recent llama.cpp update and may also be reflected in packaged apps like LM Studio, so the exact behavior depends on which backend build you’re on
–The improvement matters most for local inference and agent workflows, where KV cache size quickly becomes the bottleneck at higher context lengths
–It also underlines how much open-model UX depends on inference-stack maintenance, not just model releases
–For developers, the practical takeaway is to update the runtime before blaming the model or resizing hardware

// TAGS

llama-cppopen-sourceinferencellmgpu

DISCOVERED

99d ago

2026-04-04

PUBLISHED

99d ago

2026-04-04

RELEVANCE

9/ 10

AUTHOR

FusionCow

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO1m ago

Terrain Diffusion is an open-source framework that applies diffusion models to infinite procedural terrain generation, serving as a real-time, high-fidelity successor to Perlin noise.

Terrain Diffusion (also known as InfiniteDiffusion) is an open-source framework that bridges learned fidelity and procedural utility for open-world terrain generation. As a successor to traditional noise functions like Perlin noise, it achieves real-time interactive generation on consumer GPUs and has been integrated into a playable Minecraft mod, demonstrating its capability to construct infinite, geological worlds in real time.

NEWS1h ago

OpenAI, xAI, Meta drop major models

The AI model landscape saw unprecedented rapid shifts over a 96-hour period. OpenAI released the GPT-5.6 family to general availability, xAI took Grok 4.5 public following the SpaceX merger, and Meta introduced a new paid Model API, marking significant paradigm shifts across major AI players.

INFRA1h ago

Ritual builds infrastructure for autonomous AI agents

Ritual is an AI lab and infrastructure project that aims to move beyond simply making AI models smarter by focusing on granting them autonomous agency. The project is developing the underlying stack—including cryptography, consensus, and privacy mechanisms—required for AI agents to operate persistently, hold and spend their own money, and execute tasks without needing manual human approval for every action.