BACK_TO_FEEDAICRIER_2
KoboldCpp, llama.cpp slow on long chats
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE

KoboldCpp, llama.cpp slow on long chats

This Reddit thread is about a common local-inference pain point: once chats get long, KoboldCpp and llama.cpp can spend noticeable time re-evaluating prompt history before the next reply starts. The short version is that KV cache helps, but only when the server can actually reuse the same prefix or restore the session state cleanly.

// ANALYSIS

This is usually not a bug in the “model is thinking harder” sense; it’s the normal cost of prompt processing when cache reuse breaks or the active context is too large to stay hot.

  • KV cache is reused for unchanged prefixes, but chat turns often change enough that the server has to reprocess part or all of the prompt again
  • llama.cpp’s reuse is slot-based and prefix-sensitive, so long conversations, changing templates, or cache restoration failures can force full prompt eval
  • Gemma 4 / SWA-style models can be especially prone to this when the server mishandles full-size SWA caching, which recent llama.cpp work has been fixing
  • KoboldCpp has mitigations like `--smartcache`, `--useswa`, and quantized KV cache, but those reduce memory or improve reuse rather than eliminating recomputation entirely
  • In practice, the latency spike you’re seeing is common once context gets large enough that the server can no longer cheaply preserve the exact working prefix
// TAGS
koboldcppllama.cppkv-cacheinferencellmlocal-llmself-hostedopen-source

DISCOVERED

3h ago

2026-04-28

PUBLISHED

6h ago

2026-04-28

RELEVANCE

8/ 10

AUTHOR

alex20_202020