BACK_TO_FEEDAICRIER_2
llama.cpp slot restore still reprocesses prompts
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE

llama.cpp slot restore still reprocesses prompts

A user reports that `llama-server` slot save/restore writes a large on-disk file, but restoring it on Qwen3.5-397B-A17B still falls back to full prompt reprocessing. The upstream docs describe this as saving the slot's prompt cache, but current logs suggest true cache reuse is still gated or disabled for this model family.

// ANALYSIS

This looks less like broken persistence and more like two different mechanisms getting conflated: disk-backed slot state is being saved, but runtime prefix/KV reuse still has its own eligibility checks.

  • `--slot-save-path` enables `/slots/{id}?action=save|restore` and stores the slot's prompt cache to disk, not a generic “resume everything” checkpoint.
  • `n_written` is the serialized slot payload size, so a large file is expected for long contexts; it does not automatically mean the model will skip re-prefill on restore.
  • The `cache reuse is not supported` and `forcing full prompt re-processing due to lack of cache data` logs point to a separate reuse path being rejected at runtime.
  • Similar upstream Qwen3.5 reports show the same behavior, which makes this look like a model/runtime limitation or bug rather than a missing command-line flag.
  • `--swa-full` alone is not enough to guarantee resumable KV reuse for hybrid/SWA-style architectures.
// TAGS
llama-cppllama-serverinferenceapiself-hostedopen-source

DISCOVERED

4h ago

2026-04-24

PUBLISHED

4h ago

2026-04-23

RELEVANCE

8/ 10

AUTHOR

chrisoutwright