llama.cpp throughput drops as context grows
The post asks how to reduce the token-per-second drop that happens as chat history grows in a local llama.cpp setup running Vulkan across an MI50 and a V100. The practical answer is that this slowdown is expected: longer context means more KV-cache work and more attention cost per generated token. The most useful mitigations are to keep reusable prefixes cached, use KV-cache quantization when the backend supports it, tune batch sizes for prompt processing, and avoid letting the conversation balloon indefinitely.
Hot take: this is mostly physics, not a missing magic flag. Once context gets large, generation slows because every new token has to attend over a bigger cache.
- –Try `--flash-attn` first; llama.cpp documents it as a backend toggle and KV-cache quantization generally depends on it.
- –If supported on your build/backend, test `-ctk q8_0 -ctv q8_0` or `q4_0` to shrink KV memory pressure; smaller caches usually help long-context runs more than raw weight quantization.
- –Increase `-b` and `-ub` for prompt processing if you have headroom, since llama.cpp notes larger batches can improve prefill throughput, especially with multiple GPUs.
- –Revisit multi-GPU placement with `-sm layer`, `-ts ...`, and `-mg ...`; bad splits can leave one device doing the slow path while the other sits underused.
- –Use `--prompt-cache` or `--prompt-cache-all` when you reuse a stable system prompt or long prefix, so you do not pay the full prefill cost every turn.
- –Keep `--ctx-size` no larger than you actually need; a bigger context window increases memory pressure and makes the slowdown more painful.
- –Best practice is summarization or context eviction, not endless chat growth; if the conversation is long-lived, compress earlier turns instead of restarting blindly.
DISCOVERED
8h ago
2026-04-26
PUBLISHED
11h ago
2026-04-26
RELEVANCE
AUTHOR
WhatererBlah555