OPEN_SOURCE ↗
REDDIT · REDDIT// 24d agoTUTORIAL
llama.cpp spikes RAM at 131k context
A user on r/LocalLLaMA hit a 16GB KV cache allocation in `llama-server` after running with `n_ctx = 131072`, which caused the process to get killed on a 16GB CPU-only Linux Mint machine. The thread shows the usual trap: quantized weights may fit, but the KV cache can still blow past available RAM.
// ANALYSIS
This looks like a context-size footgun, not a broken GGUF. In llama.cpp, `-c`/`--ctx-size` directly drives KV cache allocation, so a 131k window can turn a small local setup into an OOM event.
- –The log line `n_ctx = 131072` is the smoking gun, and the reported `CPU KV buffer size = 16384.00 MiB` matches that setting.
- –Q4_K_M reduces model weight size, but it does not shrink KV cache memory by itself.
- –`llama-server` is more sensitive than a one-shot CLI run because it reserves memory for serving multiple sequences and longer prompts.
- –The most likely fix is to lower the context size or remove any lingering `-c 131072` from the launcher; llama.cpp docs and community explanations describe `--ctx-size` as the cache budget ([README](https://github.com/ggml-org/llama.cpp), [context-size discussion](https://github.com/ggerganov/llama.cpp/discussions/4130)).
// TAGS
llminferenceopen-sourceself-hosteddevtoolllama-cpp
DISCOVERED
24d ago
2026-03-19
PUBLISHED
24d ago
2026-03-19
RELEVANCE
8/ 10
AUTHOR
Automatic_Finish8598