BACK_TO_FEEDAICRIER_2
llama.cpp context shift stalls on RAG
OPEN_SOURCE ↗
REDDIT · REDDIT// 18d agoINFRASTRUCTURE

llama.cpp context shift stalls on RAG

A LocalLLaMA thread shows a KV-cache bottleneck: with a small n_predict budget, a large retrieved document forces full prompt reprocessing, but a larger generation budget lets context shift kick in. The takeaway is that cache reuse lives inside the same token budget as prompt plus completion.

// ANALYSIS

This looks like budget math, not model intuition. Context shift is a KV-cache reuse trick, so if the prompt plus RAG payload leave no safe runway for the completion, the runtime has to rebuild state or clamp generation.

  • n_predict / max output affects whether the server can reserve space for a shifted cache.
  • llama.cpp docs call this "rotating context management" and expose --cache-reuse, so the behavior is an engine policy, not a magical model feature.
  • Big retrieved chunks can destroy the suffix overlap needed for cache reuse.
  • Sliding-window-attention models and backends can be brittle here or outright disable shifting.
  • If latency matters, shrink retrieved chunks or raise ctx-size instead of expecting the model to absorb everything.
// TAGS
llama-cppllmraginferenceopen-sourceself-hosted

DISCOVERED

18d ago

2026-03-24

PUBLISHED

18d ago

2026-03-24

RELEVANCE

8/ 10

AUTHOR

DigRealistic2977