OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
llama.cpp memory climbs, leak fears rise
A Reddit user on a Strix Halo box saw RAM rise turn by turn while running Step-3.5-flash through LM Studio's llama.cpp Vulkan backend. Community replies point to host-memory prompt caching and context checkpoints as the likely cause, especially on unified-memory systems.
// ANALYSIS
This looks less like a classic leak and more like llama.cpp doing expensive caching in the least forgiving place possible: unified memory. On a 128GB UMA box, a feature that is harmless on discrete GPU systems can quietly eat the same pool needed by weights, KV cache, and context.
- –Upstream llama.cpp discussions now treat `--cache-ram` and `--ctx-checkpoints` as intentional RAM consumers, not accidental allocations.
- –On Strix Halo-style UMA, RAM and VRAM are the same pool, so cached prefixes directly compete with the model instead of living in a separate memory space.
- –LM Studio abstracts away most of the knobs, which makes normal cache growth look like a leak to users watching `htop`.
- –If disabling cache/checkpoints stops the climb, the behavior is probably expected; if it still grows, that points to a real backend or wrapper bug.
- –The practical lesson: long-context, turn-heavy workflows need explicit cache limits on unified-memory hardware or they will eventually self-DOS.
// TAGS
llminferencegpulocal-firstopen-sourcellama-cpp
DISCOVERED
4h ago
2026-05-06
PUBLISHED
5h ago
2026-05-06
RELEVANCE
8/ 10
AUTHOR
cafedude