BACK_TO_FEEDAICRIER_2
llama.cpp server slowdown sparks tuning debate
OPEN_SOURCE ↗
REDDIT · REDDIT// 35d agoINFRASTRUCTURE

llama.cpp server slowdown sparks tuning debate

A LocalLLaMA thread highlights a sharp performance drop when switching from llama-cli to llama-server on the same Qwen3.5 35B GGUF model, falling from roughly 100 tokens per second to about 10. The discussion points to server-specific defaults like parallel request slots, context allocation, and VRAM overflow as the likely culprits rather than a broken build.

// ANALYSIS

This is less a bug report than a reminder that local inference serving and single-user CLI runs optimize for different workloads.

  • Multiple commenters point to --parallel as the first thing to check, since llama-server can reserve memory for concurrent requests by default
  • If extra context slots push the model beyond available VRAM, performance can collapse as layers spill into system RAM or swap
  • The thread also surfaces a practical debugging pattern: compare verbose startup configs between llama-cli and llama-server instead of assuming matching flags mean matching behavior
  • For AI developers running local OpenAI-compatible endpoints, this is a useful warning that serving defaults can trade raw throughput for concurrency and convenience
// TAGS
llama-cppllminferenceopen-sourcedevtool

DISCOVERED

35d ago

2026-03-07

PUBLISHED

35d ago

2026-03-07

RELEVANCE

6/ 10

AUTHOR

Sumsesum