BACK_TO_FEEDAICRIER_2
llama.cpp spikes RAM at 131k context
OPEN_SOURCE ↗
REDDIT · REDDIT// 24d agoTUTORIAL

llama.cpp spikes RAM at 131k context

A user on r/LocalLLaMA hit a 16GB KV cache allocation in `llama-server` after running with `n_ctx = 131072`, which caused the process to get killed on a 16GB CPU-only Linux Mint machine. The thread shows the usual trap: quantized weights may fit, but the KV cache can still blow past available RAM.

// ANALYSIS

This looks like a context-size footgun, not a broken GGUF. In llama.cpp, `-c`/`--ctx-size` directly drives KV cache allocation, so a 131k window can turn a small local setup into an OOM event.

  • The log line `n_ctx = 131072` is the smoking gun, and the reported `CPU KV buffer size = 16384.00 MiB` matches that setting.
  • Q4_K_M reduces model weight size, but it does not shrink KV cache memory by itself.
  • `llama-server` is more sensitive than a one-shot CLI run because it reserves memory for serving multiple sequences and longer prompts.
  • The most likely fix is to lower the context size or remove any lingering `-c 131072` from the launcher; llama.cpp docs and community explanations describe `--ctx-size` as the cache budget ([README](https://github.com/ggml-org/llama.cpp), [context-size discussion](https://github.com/ggerganov/llama.cpp/discussions/4130)).
// TAGS
llminferenceopen-sourceself-hosteddevtoolllama-cpp

DISCOVERED

24d ago

2026-03-19

PUBLISHED

24d ago

2026-03-19

RELEVANCE

8/ 10

AUTHOR

Automatic_Finish8598