REDDIT · REDDIT// 4h agoINFRASTRUCTURE

llama.cpp memory climbs, leak fears rise

A Reddit user on a Strix Halo box saw RAM rise turn by turn while running Step-3.5-flash through LM Studio's llama.cpp Vulkan backend. Community replies point to host-memory prompt caching and context checkpoints as the likely cause, especially on unified-memory systems.

// ANALYSIS

This looks less like a classic leak and more like llama.cpp doing expensive caching in the least forgiving place possible: unified memory. On a 128GB UMA box, a feature that is harmless on discrete GPU systems can quietly eat the same pool needed by weights, KV cache, and context.

–Upstream llama.cpp discussions now treat `--cache-ram` and `--ctx-checkpoints` as intentional RAM consumers, not accidental allocations.
–On Strix Halo-style UMA, RAM and VRAM are the same pool, so cached prefixes directly compete with the model instead of living in a separate memory space.
–LM Studio abstracts away most of the knobs, which makes normal cache growth look like a leak to users watching `htop`.
–If disabling cache/checkpoints stops the climb, the behavior is probably expected; if it still grows, that points to a real backend or wrapper bug.
–The practical lesson: long-context, turn-heavy workflows need explicit cache limits on unified-memory hardware or they will eventually self-DOS.

// TAGS

llminferencegpulocal-firstopen-sourcellama-cpp

DISCOVERED

4h ago

2026-05-06

PUBLISHED

5h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

cafedude