Qwen 3.5 template tweak stops needless prompt reprocessing
A LocalLLaMA guide shows that in non-thinking/instruct mode, Qwen 3.5 can trigger full prompt reprocessing because empty `<think>` tags get altered across turns. The proposed Jinja fix preserves empty think blocks while still stripping real reasoning content, reducing avoidable reparse behavior in llama.cpp sessions.
This is a practical community-level patch for a real latency pain point, even if it is not an upstream architectural fix.
- –The root cause is template-level prompt mutation, not just model size or hardware limits.
- –The workaround is narrowly scoped to thinking-disabled flows, so it avoids changing normal reasoning-mode behavior.
- –Keeping empty think tags is a low-context-cost tradeoff compared with repeatedly invalidating cache reuse.
- –The author reports good coding-session behavior but no long-context benchmarks yet, so production users should validate on their own workloads.
DISCOVERED
75d ago
2026-03-14
PUBLISHED
75d ago
2026-03-13
RELEVANCE
AUTHOR
guiopen