llama.cpp context shift stalls on RAG
A LocalLLaMA thread shows a KV-cache bottleneck: with a small n_predict budget, a large retrieved document forces full prompt reprocessing, but a larger generation budget lets context shift kick in. The takeaway is that cache reuse lives inside the same token budget as prompt plus completion.
This looks like budget math, not model intuition. Context shift is a KV-cache reuse trick, so if the prompt plus RAG payload leave no safe runway for the completion, the runtime has to rebuild state or clamp generation.
- –n_predict / max output affects whether the server can reserve space for a shifted cache.
- –llama.cpp docs call this "rotating context management" and expose --cache-reuse, so the behavior is an engine policy, not a magical model feature.
- –Big retrieved chunks can destroy the suffix overlap needed for cache reuse.
- –Sliding-window-attention models and backends can be brittle here or outright disable shifting.
- –If latency matters, shrink retrieved chunks or raise ctx-size instead of expecting the model to absorb everything.
DISCOVERED
64d ago
2026-03-24
PUBLISHED
64d ago
2026-03-24
RELEVANCE
AUTHOR
DigRealistic2977