llama.cpp prompt cache falters on CPU

// 98d agoINFRASTRUCTURE

llama.cpp prompt cache falters on CPU

The post asks for a workaround to stop `llama.cpp` from re-processing the full prompt on every turn when running hybrid-attention models on CPU-only hardware. The author says Qwen3-VL worked with cache reuse, while Qwen3.5 now seems fixed and Gemma4 still appears to trigger full prompt reprocessing.

// ANALYSIS

This looks less like a user error and more like a real cache-handling limitation in `llama.cpp` for SWA or hybrid/recurrent-memory models on CPU. The noisy part is that the same backend can behave very differently depending on model architecture, so “cache works” and “cache fails” are both true depending on what’s loaded.

–`llama.cpp` is the right layer to blame here, since the log message explicitly points to cache data being unavailable, not just a small context window or a short reply.
–The fact that Qwen3-VL behaves better while Qwen3.5/Gemma4 do not suggests model-specific support gaps, not a generic CPU performance issue.
–Flags like `--swa-full` and `--flash-attn off` not changing behavior is another hint that this is about prompt/KV-cache bookkeeping, not attention kernel selection.
–For users on CPU-only setups, the practical workaround may be “use a model arch that preserves prefix cache cleanly” rather than chasing backend flags.
–This is valuable signal for `llama.cpp` maintainers because it helps separate model-architecture regressions from ordinary cache-window tuning bugs.

// TAGS

llama-cppinferenceopen-sourceself-hostedcpucachehybrid-attentionllm

DISCOVERED

98d ago

2026-04-05

PUBLISHED

98d ago

2026-04-05

RELEVANCE

8/ 10

AUTHOR

Quagmirable

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE13m ago

Win11Debloat declutters Windows 10 and 11

Win11Debloat is a lightweight, customizable PowerShell script to declutter, optimize, and customize Windows 10 and 11. It allows users to remove pre-installed bloatware apps, disable telemetry, adjust privacy settings, and tweak user interface elements through an interactive menu or command-line arguments.

LAUNCH30m ago

Odingard launches Cerberus runtime security engine

Cerberus by Odingard Security is a runtime security engine for AI agents that mitigates security risks by intercepting tool calls at the tool boundary. It specifically protects production systems against the "Lethal Trifecta"—the convergence of sensitive data access, untrusted content processing, and outbound communication channels.

RESEARCH39m ago

Smart Cellular Bricks achieve decentralized self-repair

A new Nature Communications paper by researchers from the IT University of Copenhagen, Sakana AI, and Autodesk introduces Smart Cellular Bricks, a modular 3D system capable of shape classification and self-repair. Running a decentralized Neural Cellular Automata model, the individual bricks communicate only with immediate neighbors to collectively coordinate recovery without a central controller.