BACK_TO_FEEDAICRIER_2
Qwen3.6-27B GGUF loops under llama-server
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoINFRASTRUCTURE

Qwen3.6-27B GGUF loops under llama-server

Users report Qwen3.6-27B GGUF can spiral into repetitive thinking/output when served through llama-server with `preserve_thinking` enabled. This looks less like a bad quantization and more like a decoding or chat-template interaction gone sideways.

// ANALYSIS

Hot take: the model is probably not “broken”; the serving stack is likely letting it dig its own hole.

  • Qwen3.6’s thinking mode is sensitive to template handling, and `preserve_thinking: true` can keep prior reasoning traces in play long enough to amplify repetition.
  • `repeat-penalty 1.0` is effectively no repetition penalty, so there’s little resistance once the model starts cycling.
  • `presence-penalty 0.0` and an aggressive speculative setup (`ngram-mod`) may make the loop harder to escape once the token stream drifts.
  • A cleaner test is to disable thinking preservation, add a real repeat penalty, and compare against a simpler llama.cpp template before blaming the GGUF.
  • The broader lesson: Qwen3.6 behaves more like a reasoning model than a plain chat model, so “works on my config” is very template-sensitive.
// TAGS
qwen3.6-27bllminferenceopen-sourceself-hostedreasoning

DISCOVERED

3h ago

2026-04-29

PUBLISHED

4h ago

2026-04-29

RELEVANCE

9/ 10

AUTHOR

Kirys79