OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE
KoboldCpp, llama.cpp slow on long chats
This Reddit thread is about a common local-inference pain point: once chats get long, KoboldCpp and llama.cpp can spend noticeable time re-evaluating prompt history before the next reply starts. The short version is that KV cache helps, but only when the server can actually reuse the same prefix or restore the session state cleanly.
// ANALYSIS
This is usually not a bug in the “model is thinking harder” sense; it’s the normal cost of prompt processing when cache reuse breaks or the active context is too large to stay hot.
- –KV cache is reused for unchanged prefixes, but chat turns often change enough that the server has to reprocess part or all of the prompt again
- –llama.cpp’s reuse is slot-based and prefix-sensitive, so long conversations, changing templates, or cache restoration failures can force full prompt eval
- –Gemma 4 / SWA-style models can be especially prone to this when the server mishandles full-size SWA caching, which recent llama.cpp work has been fixing
- –KoboldCpp has mitigations like `--smartcache`, `--useswa`, and quantized KV cache, but those reduce memory or improve reuse rather than eliminating recomputation entirely
- –In practice, the latency spike you’re seeing is common once context gets large enough that the server can no longer cheaply preserve the exact working prefix
// TAGS
koboldcppllama.cppkv-cacheinferencellmlocal-llmself-hostedopen-source
DISCOVERED
3h ago
2026-04-28
PUBLISHED
6h ago
2026-04-28
RELEVANCE
8/ 10
AUTHOR
alex20_202020