Llama.cpp budget flags curb Qwen3.5 overthinking

// 117d agoTUTORIAL

Llama.cpp budget flags curb Qwen3.5 overthinking

A r/LocalLLaMA post shares a practical workaround for Qwen3.5's runaway "But wait..." reasoning loops: pass `--reasoning-budget` and `--reasoning-budget-message` to llama-server to hard-cap thinking tokens and inject a termination phrase.

// ANALYSIS

Reasoning models that can't stop reasoning are a real UX problem, and this two-flag fix is the kind of pragmatic hack local-inference users actually need.

–`--reasoning-budget <N>` caps the thinking block at N tokens, preventing infinite refinement spirals
–`--reasoning-budget-message` appends a stop phrase that nudges the model to skip straight to the answer
–The author warns that very low budgets (<1024 tokens) degrade output quality — there's a tradeoff between stopping the loop and giving the model enough space to actually think
–This is already merged into llama.cpp and likely generalizable to other inference engines with similar budget-control flags
–Qwen3.5's MoE architecture (35B-A3B) makes it attractive for local deployment, so taming its thinking verbosity has real practical value

// TAGS

llama.cppllminferenceopen-sourceqwen3.5self-hosted

DISCOVERED

117d ago

2026-03-16

PUBLISHED

117d ago

2026-03-16

RELEVANCE

6/ 10

AUTHOR

floconildo

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE1h ago

prose stylesheet forces clean AI writing

prose is a lightweight, single-file Markdown prompt configuration that guides AI coding agents to communicate like a direct, confident senior engineer. Appended directly to local agent instruction files, it establishes clear rules to eliminate common AI patterns like cheesy setups, over-bulleted reasoning, and theatrical language.

MODEL4h ago

Reve 2.1 drops native 4K rendering

Reve has released version 2.1 of its creative image generation model, introducing native 4K rendering, object-level editing, and a new "Live Layers" feature. The update enables users to perform localized edits and manage layouts directly, catering to professional design workflows requiring precise control.

OPEN SOURCE4h ago

ABot-World simulates infinite 720p worlds on single GPU

ABot-World is an open-source, action-conditioned infinite world simulator designed to generate interactive 720p environments at 16 frames per second with low latency on a single desktop GPU. By utilizing an NVIDIA RTX 5090 and requiring just 19GB of GPU memory, this embodied world model offers physical compliance, action controllability, and zero-shot generalization, making real-time, interactive environment simulation accessible on consumer-grade hardware.