OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoTUTORIAL
Qwen3.5 reasoning loops force llama.cpp tweaks
LocalLLaMA users are seeing Qwen3.5 Q4 slip into long reasoning loops under llama.cpp and asking which knobs actually help. The thread points toward explicit thinking-mode control plus the model’s recommended non-thinking sampler settings, not a mysterious hard failure.
// ANALYSIS
This looks mostly like a mode-control problem, not a broken model.
- –The official Qwen3.5 recipe for direct-response mode is `enable_thinking=False` plus `temperature=0.7`, `top_p=0.8`, `top_k=20`, `presence_penalty=1.5`, and `repetition_penalty=1.0`.
- –Qwen3.5 does not officially support the older `/think` and `/nothink` soft switch, so prompt hacks are less dependable than template-level control.
- –If `enable_thinking` still leaks through in llama.cpp, the likely culprit is a template or server-version mismatch rather than sampler settings alone.
- –For local deployments, the practical split is simple: force non-thinking for chat and keep thinking mode only for tasks that genuinely benefit from long deliberation.
// TAGS
qwen3-5llama-cppllmreasoninginferenceself-hostedopen-weights
DISCOVERED
14d ago
2026-03-29
PUBLISHED
14d ago
2026-03-29
RELEVANCE
8/ 10
AUTHOR
XiRw