BACK_TO_FEEDAICRIER_2
Tool Calls Break KV-Cache Reuse in llama.cpp
OPEN_SOURCE ↗
REDDIT · REDDIT// 14h agoINFRASTRUCTURE

Tool Calls Break KV-Cache Reuse in llama.cpp

This Reddit post describes repeated “checkpoint” loss during single-user chats in llama.cpp-backed frontends, specifically when function/tool calls are part of the conversation history. The poster says the issue shows up across Cherry Studio and Open WebUI with Qwen 3.5/3.6 models, even with plenty of context and cache RAM available, and suspects the problem may be related to tool-call content or thinking traces not being preserved between turns.

// ANALYSIS

Hot take: this sounds more like prompt reconstruction and chat-template mismatch than a raw context-capacity problem.

  • llama.cpp’s cache reuse depends on the exact prompt prefix staying stable; if a frontend rewrites or omits tool/tool_result turns, the reusable KV prefix is effectively broken.
  • Function-calling support in llama.cpp is real, but it is still sensitive to template format and client behavior, so “erased checkpoints” can be the symptom of serialization drift rather than memory exhaustion.
  • `preserve_thinking: true` only helps if the model/template path actually carries reasoning traces forward; it will not fix missing tool-call round-trips.
  • The most useful debug step is to compare the exact JSON sent on the next turn, especially whether assistant `tool_calls` and subsequent `tool` messages are being replayed verbatim.
// TAGS
llama-cppkv-cachefunction callingtool callingopen-webuicherry studioqwencontext management

DISCOVERED

14h ago

2026-04-17

PUBLISHED

16h ago

2026-04-17

RELEVANCE

5/ 10

AUTHOR

SimilarWarthog8393