llama.cpp context shift stalls on RAG

// 77d agoINFRASTRUCTURE

llama.cpp context shift stalls on RAG

A LocalLLaMA thread shows a KV-cache bottleneck: with a small n_predict budget, a large retrieved document forces full prompt reprocessing, but a larger generation budget lets context shift kick in. The takeaway is that cache reuse lives inside the same token budget as prompt plus completion.

// ANALYSIS

This looks like budget math, not model intuition. Context shift is a KV-cache reuse trick, so if the prompt plus RAG payload leave no safe runway for the completion, the runtime has to rebuild state or clamp generation.

–n_predict / max output affects whether the server can reserve space for a shifted cache.
–llama.cpp docs call this "rotating context management" and expose --cache-reuse, so the behavior is an engine policy, not a magical model feature.
–Big retrieved chunks can destroy the suffix overlap needed for cache reuse.
–Sliding-window-attention models and backends can be brittle here or outright disable shifting.
–If latency matters, shrink retrieved chunks or raise ctx-size instead of expecting the model to absorb everything.

// TAGS

llama-cppllmraginferenceopen-sourceself-hosted

DISCOVERED

77d ago

2026-03-24

PUBLISHED

77d ago

2026-03-24

RELEVANCE

8/ 10

AUTHOR

DigRealistic2977

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL1h ago

Claude Fable 5 hits Google Cloud

Anthropic's new Mythos-class frontier AI model, Claude Fable 5, is now generally available on Google Cloud's Agent Platform (Vertex AI). Designed for complex, long-horizon reasoning and autonomous workflows, Fable 5 is built for tasks such as software engineering, deep research, and multi-day agentic execution, featuring built-in safety guardrails that automatically redirect sensitive queries to Claude Opus 4.8.

UPDATE1h ago

B.AI integrates Claude Fable 5 into developer API

Developer platform B.AI has integrated Anthropic's Claude Fable 5 model into its API ecosystem. Developers can now utilize Claude Fable 5's advanced reasoning and code generation capabilities within B.AI's unified, OpenAI-compatible API framework, which simplifies model access, agent identity management, and transaction payments.

MODEL1h ago

Claude Fable 5 solves logic benchmarks

Anthropic's newly released Claude Fable 5 model demonstrates the capability to solve difficult reasoning and logic questions that commonly trip up other LLMs, such as counting characters or comparing numeric values. As the first publicly available model in Anthropic's Mythos-class architecture, Fable 5 leverages automated guardrails that route restricted topics to Claude Opus 4.8.