YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp context shift stalls on RAG

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp context shift stalls on RAG
OPEN LINK ↗
// 64d agoINFRASTRUCTURE

llama.cpp context shift stalls on RAG

A LocalLLaMA thread shows a KV-cache bottleneck: with a small n_predict budget, a large retrieved document forces full prompt reprocessing, but a larger generation budget lets context shift kick in. The takeaway is that cache reuse lives inside the same token budget as prompt plus completion.

// ANALYSIS

This looks like budget math, not model intuition. Context shift is a KV-cache reuse trick, so if the prompt plus RAG payload leave no safe runway for the completion, the runtime has to rebuild state or clamp generation.

  • n_predict / max output affects whether the server can reserve space for a shifted cache.
  • llama.cpp docs call this "rotating context management" and expose --cache-reuse, so the behavior is an engine policy, not a magical model feature.
  • Big retrieved chunks can destroy the suffix overlap needed for cache reuse.
  • Sliding-window-attention models and backends can be brittle here or outright disable shifting.
  • If latency matters, shrink retrieved chunks or raise ctx-size instead of expecting the model to absorb everything.
// TAGS
llama-cppllmraginferenceopen-sourceself-hosted

DISCOVERED

64d ago

2026-03-24

PUBLISHED

64d ago

2026-03-24

RELEVANCE

8/ 10

AUTHOR

DigRealistic2977