YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp throughput drops as context grows

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp throughput drops as context grows
OPEN LINK ↗
// 45d agoTUTORIAL

llama.cpp throughput drops as context grows

The post asks how to reduce the token-per-second drop that happens as chat history grows in a local llama.cpp setup running Vulkan across an MI50 and a V100. The practical answer is that this slowdown is expected: longer context means more KV-cache work and more attention cost per generated token. The most useful mitigations are to keep reusable prefixes cached, use KV-cache quantization when the backend supports it, tune batch sizes for prompt processing, and avoid letting the conversation balloon indefinitely.

// ANALYSIS

Hot take: this is mostly physics, not a missing magic flag. Once context gets large, generation slows because every new token has to attend over a bigger cache.

  • Try `--flash-attn` first; llama.cpp documents it as a backend toggle and KV-cache quantization generally depends on it.
  • If supported on your build/backend, test `-ctk q8_0 -ctv q8_0` or `q4_0` to shrink KV memory pressure; smaller caches usually help long-context runs more than raw weight quantization.
  • Increase `-b` and `-ub` for prompt processing if you have headroom, since llama.cpp notes larger batches can improve prefill throughput, especially with multiple GPUs.
  • Revisit multi-GPU placement with `-sm layer`, `-ts ...`, and `-mg ...`; bad splits can leave one device doing the slow path while the other sits underused.
  • Use `--prompt-cache` or `--prompt-cache-all` when you reuse a stable system prompt or long prefix, so you do not pay the full prefill cost every turn.
  • Keep `--ctx-size` no larger than you actually need; a bigger context window increases memory pressure and makes the slowdown more painful.
  • Best practice is summarization or context eviction, not endless chat growth; if the conversation is long-lived, compress earlier turns instead of restarting blindly.
// TAGS
llama-cppvulkankv-cachelong-contextlocal-llmperformance-tuningmi50v100

DISCOVERED

45d ago

2026-04-26

PUBLISHED

45d ago

2026-04-26

RELEVANCE

8/ 10

AUTHOR

WhatererBlah555