Qwen3.5 reasoning loops force llama.cpp tweaks
LocalLLaMA users are seeing Qwen3.5 Q4 slip into long reasoning loops under llama.cpp and asking which knobs actually help. The thread points toward explicit thinking-mode control plus the model’s recommended non-thinking sampler settings, not a mysterious hard failure.
This looks mostly like a mode-control problem, not a broken model.
- –The official Qwen3.5 recipe for direct-response mode is `enable_thinking=False` plus `temperature=0.7`, `top_p=0.8`, `top_k=20`, `presence_penalty=1.5`, and `repetition_penalty=1.0`.
- –Qwen3.5 does not officially support the older `/think` and `/nothink` soft switch, so prompt hacks are less dependable than template-level control.
- –If `enable_thinking` still leaks through in llama.cpp, the likely culprit is a template or server-version mismatch rather than sampler settings alone.
- –For local deployments, the practical split is simple: force non-thinking for chat and keep thinking mode only for tasks that genuinely benefit from long deliberation.
DISCOVERED
60d ago
2026-03-29
PUBLISHED
60d ago
2026-03-29
RELEVANCE
AUTHOR
XiRw