BACK_TO_FEEDAICRIER_2
Qwen3.6 users report reasoning loops
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE

Qwen3.6 users report reasoning loops

A LocalLLaMA user says Unsloth's Q4_K_XL GGUF quant of Qwen3.6-35B-A3B is slower than IQ4_XS on their 8GB VRAM setup and appears more prone to looping during reasoning. The thread is more troubleshooting signal than news, but it highlights the practical tradeoffs local users face when chasing lower KLD quants.

// ANALYSIS

This is the messy underside of open-weight inference: better quant metrics do not automatically mean better wall-clock behavior, especially with reasoning mode, MoE routing, huge context, CPU offload, and fork-specific llama.cpp behavior in the mix.

  • Qwen3.6-35B-A3B is a serious open MoE model, but local serving stability still depends heavily on sampler settings, template handling, backend version, and quant choice
  • The user's config keeps reasoning on with unlimited budget, making repeated internal reasoning especially expensive when the model starts cycling
  • Q4_K_XL may preserve quality better than smaller IQ quants, but the speed drop from 40 tok/s to 27 tok/s can erase that benefit for interactive use
  • Recent community chatter around Qwen3.6 points to backend quirks in speculative decoding, tool calls, and recurrent-state handling, so upgrading llama.cpp/TurboQuant builds may matter as much as sampler tweaks
// TAGS
qwen3.6-35b-a3bllmreasoninginferencegpuself-hostedopen-weights

DISCOVERED

4h ago

2026-04-23

PUBLISHED

5h ago

2026-04-23

RELEVANCE

6/ 10

AUTHOR

EggDroppedSoup