BACK_TO_FEEDAICRIER_2
llama.cpp slots slash model-switch latency
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE

llama.cpp slots slash model-switch latency

A r/LocalLLaMA post shows how llama.cpp slot persistence plus a small Python supervisor can cut cold model-switch latency from minutes to seconds on a single 3090 Ti. The trick is restoring cached KV state across swaps instead of re-prefilling 100k+ token contexts.

// ANALYSIS

This is a cache-engineering win, not a model win: once slot state survives process swaps, long-context local agents stop paying the prefill tax every time they come back.

  • The reported speedup comes from KV restore, so it matters most for swap-heavy, context-heavy workflows.
  • The setup is brittle by design: model bytes, context size, quantization, and prompt stability all have to stay fixed or the cache becomes invalid.
  • Hardlinked slot bins plus a supervisor that normalizes prompts turns llama.cpp into something closer to a session manager than a stateless server.
  • The open PRs make this promising evidence, but not yet a polished upstream feature.
  • opencode’s cache-stabilization flag is a practical dependency here because volatile system prompts would otherwise blow the prefix hash.
// TAGS
llama-cppopencodelong-contextinferencelocal-firstself-hostedcliautomation

DISCOVERED

3h ago

2026-05-06

PUBLISHED

5h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

yes_i_tried_google