llama.cpp server slowdown sparks tuning debate

// 127d agoINFRASTRUCTURE

llama.cpp server slowdown sparks tuning debate

A LocalLLaMA thread highlights a sharp performance drop when switching from llama-cli to llama-server on the same Qwen3.5 35B GGUF model, falling from roughly 100 tokens per second to about 10. The discussion points to server-specific defaults like parallel request slots, context allocation, and VRAM overflow as the likely culprits rather than a broken build.

// ANALYSIS

This is less a bug report than a reminder that local inference serving and single-user CLI runs optimize for different workloads.

–Multiple commenters point to --parallel as the first thing to check, since llama-server can reserve memory for concurrent requests by default
–If extra context slots push the model beyond available VRAM, performance can collapse as layers spill into system RAM or swap
–The thread also surfaces a practical debugging pattern: compare verbose startup configs between llama-cli and llama-server instead of assuming matching flags mean matching behavior
–For AI developers running local OpenAI-compatible endpoints, this is a useful warning that serving defaults can trade raw throughput for concurrency and convenience

// TAGS

llama-cppllminferenceopen-sourcedevtool

DISCOVERED

127d ago

2026-03-07

PUBLISHED

127d ago

2026-03-07

RELEVANCE

6/ 10

AUTHOR

Sumsesum

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE7m ago

OpenDesign integrates Meta Muse Spark API

OpenDesign is an open-source, local-first design workspace that can be paired with Meta's Muse Spark to generate code-ready prototypes and UI screens directly from screenshots and prompts. This integration bridges the gap between visual design and software development, providing developers with an interactive workspace to rapidly iterate on AI-generated user interfaces.

UPDATE7m ago

T3 Code updates agent GUI with git worktrees

T3 Code has updated its local-first GUI for orchestrating AI coding agents, adding multi-provider key and subscription management. The release also introduces native support for git worktrees, custom automation actions, and side-by-side split diffs to safely run multiple agent workflows in parallel.

UPDATE1h ago

Grok Build adds multiline input, scrolling

SpaceXAI has released Grok Build versions 0.2.99 and 0.2.98, introducing multiline input and terminal scrolling for its terminal-based AI coding assistant. The updates allow users to input complex prompts directly on the dashboard and scroll through chat histories using PageUp and PageDown.