OPEN_SOURCE ↗
REDDIT · REDDIT// 35d agoINFRASTRUCTURE
llama.cpp server slowdown sparks tuning debate
A LocalLLaMA thread highlights a sharp performance drop when switching from llama-cli to llama-server on the same Qwen3.5 35B GGUF model, falling from roughly 100 tokens per second to about 10. The discussion points to server-specific defaults like parallel request slots, context allocation, and VRAM overflow as the likely culprits rather than a broken build.
// ANALYSIS
This is less a bug report than a reminder that local inference serving and single-user CLI runs optimize for different workloads.
- –Multiple commenters point to --parallel as the first thing to check, since llama-server can reserve memory for concurrent requests by default
- –If extra context slots push the model beyond available VRAM, performance can collapse as layers spill into system RAM or swap
- –The thread also surfaces a practical debugging pattern: compare verbose startup configs between llama-cli and llama-server instead of assuming matching flags mean matching behavior
- –For AI developers running local OpenAI-compatible endpoints, this is a useful warning that serving defaults can trade raw throughput for concurrency and convenience
// TAGS
llama-cppllminferenceopen-sourcedevtool
DISCOVERED
35d ago
2026-03-07
PUBLISHED
35d ago
2026-03-07
RELEVANCE
6/ 10
AUTHOR
Sumsesum