BACK_TO_FEEDAICRIER_2
llama.cpp prompt cache falters on CPU
OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoINFRASTRUCTURE

llama.cpp prompt cache falters on CPU

The post asks for a workaround to stop `llama.cpp` from re-processing the full prompt on every turn when running hybrid-attention models on CPU-only hardware. The author says Qwen3-VL worked with cache reuse, while Qwen3.5 now seems fixed and Gemma4 still appears to trigger full prompt reprocessing.

// ANALYSIS

This looks less like a user error and more like a real cache-handling limitation in `llama.cpp` for SWA or hybrid/recurrent-memory models on CPU. The noisy part is that the same backend can behave very differently depending on model architecture, so “cache works” and “cache fails” are both true depending on what’s loaded.

  • `llama.cpp` is the right layer to blame here, since the log message explicitly points to cache data being unavailable, not just a small context window or a short reply.
  • The fact that Qwen3-VL behaves better while Qwen3.5/Gemma4 do not suggests model-specific support gaps, not a generic CPU performance issue.
  • Flags like `--swa-full` and `--flash-attn off` not changing behavior is another hint that this is about prompt/KV-cache bookkeeping, not attention kernel selection.
  • For users on CPU-only setups, the practical workaround may be “use a model arch that preserves prefix cache cleanly” rather than chasing backend flags.
  • This is valuable signal for `llama.cpp` maintainers because it helps separate model-architecture regressions from ordinary cache-window tuning bugs.
// TAGS
llama-cppinferenceopen-sourceself-hostedcpucachehybrid-attentionllm

DISCOVERED

6d ago

2026-04-05

PUBLISHED

6d ago

2026-04-05

RELEVANCE

8/ 10

AUTHOR

Quagmirable