OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE
Qwen3.6-27B GGUF loops under llama-server
Users report Qwen3.6-27B GGUF can spiral into repetitive thinking/output when served through llama-server with `preserve_thinking` enabled. This looks less like a bad quantization and more like a decoding or chat-template interaction gone sideways.
// ANALYSIS
Hot take: the model is probably not “broken”; the serving stack is likely letting it dig its own hole.
- –Qwen3.6’s thinking mode is sensitive to template handling, and `preserve_thinking: true` can keep prior reasoning traces in play long enough to amplify repetition.
- –`repeat-penalty 1.0` is effectively no repetition penalty, so there’s little resistance once the model starts cycling.
- –`presence-penalty 0.0` and an aggressive speculative setup (`ngram-mod`) may make the loop harder to escape once the token stream drifts.
- –A cleaner test is to disable thinking preservation, add a real repeat penalty, and compare against a simpler llama.cpp template before blaming the GGUF.
- –The broader lesson: Qwen3.6 behaves more like a reasoning model than a plain chat model, so “works on my config” is very template-sensitive.
// TAGS
qwen3.6-27bllminferenceopen-sourceself-hostedreasoning
DISCOVERED
3h ago
2026-04-29
PUBLISHED
4h ago
2026-04-29
RELEVANCE
9/ 10
AUTHOR
Kirys79