OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE
llama.cpp slots slash model-switch latency
A r/LocalLLaMA post shows how llama.cpp slot persistence plus a small Python supervisor can cut cold model-switch latency from minutes to seconds on a single 3090 Ti. The trick is restoring cached KV state across swaps instead of re-prefilling 100k+ token contexts.
// ANALYSIS
This is a cache-engineering win, not a model win: once slot state survives process swaps, long-context local agents stop paying the prefill tax every time they come back.
- –The reported speedup comes from KV restore, so it matters most for swap-heavy, context-heavy workflows.
- –The setup is brittle by design: model bytes, context size, quantization, and prompt stability all have to stay fixed or the cache becomes invalid.
- –Hardlinked slot bins plus a supervisor that normalizes prompts turns llama.cpp into something closer to a session manager than a stateless server.
- –The open PRs make this promising evidence, but not yet a polished upstream feature.
- –opencode’s cache-stabilization flag is a practical dependency here because volatile system prompts would otherwise blow the prefix hash.
// TAGS
llama-cppopencodelong-contextinferencelocal-firstself-hostedcliautomation
DISCOVERED
3h ago
2026-05-06
PUBLISHED
5h ago
2026-05-06
RELEVANCE
8/ 10
AUTHOR
yes_i_tried_google