OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoTUTORIAL
Llama.cpp budget flags curb Qwen3.5 overthinking
A r/LocalLLaMA post shares a practical workaround for Qwen3.5's runaway "But wait..." reasoning loops: pass `--reasoning-budget` and `--reasoning-budget-message` to llama-server to hard-cap thinking tokens and inject a termination phrase.
// ANALYSIS
Reasoning models that can't stop reasoning are a real UX problem, and this two-flag fix is the kind of pragmatic hack local-inference users actually need.
- –`--reasoning-budget <N>` caps the thinking block at N tokens, preventing infinite refinement spirals
- –`--reasoning-budget-message` appends a stop phrase that nudges the model to skip straight to the answer
- –The author warns that very low budgets (<1024 tokens) degrade output quality — there's a tradeoff between stopping the loop and giving the model enough space to actually think
- –This is already merged into llama.cpp and likely generalizable to other inference engines with similar budget-control flags
- –Qwen3.5's MoE architecture (35B-A3B) makes it attractive for local deployment, so taming its thinking verbosity has real practical value
// TAGS
llama.cppllminferenceopen-sourceqwen3.5self-hosted
DISCOVERED
27d ago
2026-03-16
PUBLISHED
27d ago
2026-03-16
RELEVANCE
6/ 10
AUTHOR
floconildo