llama.cpp router mode hits tool parsing issues
llama.cpp has introduced a new --models-preset option and router mode for dynamic model management, but users report persistent parsing errors with agentic tools like Claude Code and Gemini CLI. While higher-level wrappers like Ollama handle these integrations seamlessly, the raw engine faces fragmentation issues with Jinja templates and GGUF metadata across different model families.
The shift to a native model router is a major leap for llama.cpp, but it exposes the fragile state of local LLM tool-calling compatibility. Tool call parsing errors likely stem from llama-server returning JSON objects for arguments instead of the legacy stringified JSON expected by OpenAI-compatible clients. Ollama's "just works" experience is powered by an internal normalization layer that abstracts away the complexity of divergent model templates. This fragmentation between model quants and engines creates a compatibility tax for raw users, though the new LRU dynamic loading feature remains a significant optimization for multi-model pipelines on consumer VRAM.
DISCOVERED
1d ago
2026-04-11
PUBLISHED
1d ago
2026-04-10
RELEVANCE
AUTHOR
chibop1