BACK_TO_FEEDAICRIER_2
Llama.cpp budget flags curb Qwen3.5 overthinking
OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoTUTORIAL

Llama.cpp budget flags curb Qwen3.5 overthinking

A r/LocalLLaMA post shares a practical workaround for Qwen3.5's runaway "But wait..." reasoning loops: pass `--reasoning-budget` and `--reasoning-budget-message` to llama-server to hard-cap thinking tokens and inject a termination phrase.

// ANALYSIS

Reasoning models that can't stop reasoning are a real UX problem, and this two-flag fix is the kind of pragmatic hack local-inference users actually need.

  • `--reasoning-budget <N>` caps the thinking block at N tokens, preventing infinite refinement spirals
  • `--reasoning-budget-message` appends a stop phrase that nudges the model to skip straight to the answer
  • The author warns that very low budgets (<1024 tokens) degrade output quality — there's a tradeoff between stopping the loop and giving the model enough space to actually think
  • This is already merged into llama.cpp and likely generalizable to other inference engines with similar budget-control flags
  • Qwen3.5's MoE architecture (35B-A3B) makes it attractive for local deployment, so taming its thinking verbosity has real practical value
// TAGS
llama.cppllminferenceopen-sourceqwen3.5self-hosted

DISCOVERED

27d ago

2026-03-16

PUBLISHED

27d ago

2026-03-16

RELEVANCE

6/ 10

AUTHOR

floconildo