BACK_TO_FEEDAICRIER_2
llama.cpp throughput drops as context grows
OPEN_SOURCE ↗
REDDIT · REDDIT// 8h agoTUTORIAL

llama.cpp throughput drops as context grows

The post asks how to reduce the token-per-second drop that happens as chat history grows in a local llama.cpp setup running Vulkan across an MI50 and a V100. The practical answer is that this slowdown is expected: longer context means more KV-cache work and more attention cost per generated token. The most useful mitigations are to keep reusable prefixes cached, use KV-cache quantization when the backend supports it, tune batch sizes for prompt processing, and avoid letting the conversation balloon indefinitely.

// ANALYSIS

Hot take: this is mostly physics, not a missing magic flag. Once context gets large, generation slows because every new token has to attend over a bigger cache.

  • Try `--flash-attn` first; llama.cpp documents it as a backend toggle and KV-cache quantization generally depends on it.
  • If supported on your build/backend, test `-ctk q8_0 -ctv q8_0` or `q4_0` to shrink KV memory pressure; smaller caches usually help long-context runs more than raw weight quantization.
  • Increase `-b` and `-ub` for prompt processing if you have headroom, since llama.cpp notes larger batches can improve prefill throughput, especially with multiple GPUs.
  • Revisit multi-GPU placement with `-sm layer`, `-ts ...`, and `-mg ...`; bad splits can leave one device doing the slow path while the other sits underused.
  • Use `--prompt-cache` or `--prompt-cache-all` when you reuse a stable system prompt or long prefix, so you do not pay the full prefill cost every turn.
  • Keep `--ctx-size` no larger than you actually need; a bigger context window increases memory pressure and makes the slowdown more painful.
  • Best practice is summarization or context eviction, not endless chat growth; if the conversation is long-lived, compress earlier turns instead of restarting blindly.
// TAGS
llama-cppvulkankv-cachelong-contextlocal-llmperformance-tuningmi50v100

DISCOVERED

8h ago

2026-04-26

PUBLISHED

11h ago

2026-04-26

RELEVANCE

8/ 10

AUTHOR

WhatererBlah555