BACK_TO_FEEDAICRIER_2
Qwen3.5 reasoning loops force llama.cpp tweaks
OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoTUTORIAL

Qwen3.5 reasoning loops force llama.cpp tweaks

LocalLLaMA users are seeing Qwen3.5 Q4 slip into long reasoning loops under llama.cpp and asking which knobs actually help. The thread points toward explicit thinking-mode control plus the model’s recommended non-thinking sampler settings, not a mysterious hard failure.

// ANALYSIS

This looks mostly like a mode-control problem, not a broken model.

  • The official Qwen3.5 recipe for direct-response mode is `enable_thinking=False` plus `temperature=0.7`, `top_p=0.8`, `top_k=20`, `presence_penalty=1.5`, and `repetition_penalty=1.0`.
  • Qwen3.5 does not officially support the older `/think` and `/nothink` soft switch, so prompt hacks are less dependable than template-level control.
  • If `enable_thinking` still leaks through in llama.cpp, the likely culprit is a template or server-version mismatch rather than sampler settings alone.
  • For local deployments, the practical split is simple: force non-thinking for chat and keep thinking mode only for tasks that genuinely benefit from long deliberation.
// TAGS
qwen3-5llama-cppllmreasoninginferenceself-hostedopen-weights

DISCOVERED

14d ago

2026-03-29

PUBLISHED

14d ago

2026-03-29

RELEVANCE

8/ 10

AUTHOR

XiRw