llama.cpp slot restore still reprocesses prompts

// 90d agoINFRASTRUCTURE

llama.cpp slot restore still reprocesses prompts

A user reports that `llama-server` slot save/restore writes a large on-disk file, but restoring it on Qwen3.5-397B-A17B still falls back to full prompt reprocessing. The upstream docs describe this as saving the slot's prompt cache, but current logs suggest true cache reuse is still gated or disabled for this model family.

// ANALYSIS

This looks less like broken persistence and more like two different mechanisms getting conflated: disk-backed slot state is being saved, but runtime prefix/KV reuse still has its own eligibility checks.

–`--slot-save-path` enables `/slots/{id}?action=save|restore` and stores the slot's prompt cache to disk, not a generic “resume everything” checkpoint.
–`n_written` is the serialized slot payload size, so a large file is expected for long contexts; it does not automatically mean the model will skip re-prefill on restore.
–The `cache reuse is not supported` and `forcing full prompt re-processing due to lack of cache data` logs point to a separate reuse path being rejected at runtime.
–Similar upstream Qwen3.5 reports show the same behavior, which makes this look like a model/runtime limitation or bug rather than a missing command-line flag.
–`--swa-full` alone is not enough to guarantee resumable KV reuse for hybrid/SWA-style architectures.

// TAGS

llama-cppllama-serverinferenceapiself-hostedopen-source

DISCOVERED

90d ago

2026-04-24

PUBLISHED

90d ago

2026-04-23

RELEVANCE

8/ 10

AUTHOR

chrisoutwright

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL29m ago

OpenRouter adds Deepgram Nova-3 and Aura-2 models

OpenRouter has added Deepgram's Nova-3 speech-to-text and Aura-2 text-to-speech models to its unified API platform. The addition allows developers to build full voice-enabled AI pipelines supporting multilingual transcription and speech synthesis across seven languages.

MODEL34m ago

Bad Theory Labs releases new small language model

RoliumGens announced a partnership with @alameenpd at Bad Theory Labs to release a new small language model designed for strong performance relative to its size. Following this release, research efforts are expanding into reinforcement learning to further investigate model efficiency and learning paradigms.

UPDATE36m ago

Netlify Combines Netlify Drop With Agent Runners

Netlify highlighted a workflow integrating Netlify Drop with AI Agent Runners, enabling users to drag and drop static site files for instant live deployment and then instruct AI agents to edit and customize the application directly within Netlify's platform.