OPEN_SOURCE ↗
REDDIT · REDDIT// 18d agoINFRASTRUCTURE
llama.cpp context shift stalls on RAG
A LocalLLaMA thread shows a KV-cache bottleneck: with a small n_predict budget, a large retrieved document forces full prompt reprocessing, but a larger generation budget lets context shift kick in. The takeaway is that cache reuse lives inside the same token budget as prompt plus completion.
// ANALYSIS
This looks like budget math, not model intuition. Context shift is a KV-cache reuse trick, so if the prompt plus RAG payload leave no safe runway for the completion, the runtime has to rebuild state or clamp generation.
- –n_predict / max output affects whether the server can reserve space for a shifted cache.
- –llama.cpp docs call this "rotating context management" and expose --cache-reuse, so the behavior is an engine policy, not a magical model feature.
- –Big retrieved chunks can destroy the suffix overlap needed for cache reuse.
- –Sliding-window-attention models and backends can be brittle here or outright disable shifting.
- –If latency matters, shrink retrieved chunks or raise ctx-size instead of expecting the model to absorb everything.
// TAGS
llama-cppllmraginferenceopen-sourceself-hosted
DISCOVERED
18d ago
2026-03-24
PUBLISHED
18d ago
2026-03-24
RELEVANCE
8/ 10
AUTHOR
DigRealistic2977