REDDIT · REDDIT// 3h agoINFRASTRUCTURE

llama.cpp slots slash model-switch latency

A r/LocalLLaMA post shows how llama.cpp slot persistence plus a small Python supervisor can cut cold model-switch latency from minutes to seconds on a single 3090 Ti. The trick is restoring cached KV state across swaps instead of re-prefilling 100k+ token contexts.

// ANALYSIS

This is a cache-engineering win, not a model win: once slot state survives process swaps, long-context local agents stop paying the prefill tax every time they come back.

–The reported speedup comes from KV restore, so it matters most for swap-heavy, context-heavy workflows.
–The setup is brittle by design: model bytes, context size, quantization, and prompt stability all have to stay fixed or the cache becomes invalid.
–Hardlinked slot bins plus a supervisor that normalizes prompts turns llama.cpp into something closer to a session manager than a stateless server.
–The open PRs make this promising evidence, but not yet a polished upstream feature.
–opencode’s cache-stabilization flag is a practical dependency here because volatile system prompts would otherwise blow the prefix hash.

// TAGS

llama-cppopencodelong-contextinferencelocal-firstself-hostedcliautomation

DISCOVERED

3h ago

2026-05-06

PUBLISHED

5h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

yes_i_tried_google