OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
Qwen3.6 users report reasoning loops
A LocalLLaMA user says Unsloth's Q4_K_XL GGUF quant of Qwen3.6-35B-A3B is slower than IQ4_XS on their 8GB VRAM setup and appears more prone to looping during reasoning. The thread is more troubleshooting signal than news, but it highlights the practical tradeoffs local users face when chasing lower KLD quants.
// ANALYSIS
This is the messy underside of open-weight inference: better quant metrics do not automatically mean better wall-clock behavior, especially with reasoning mode, MoE routing, huge context, CPU offload, and fork-specific llama.cpp behavior in the mix.
- –Qwen3.6-35B-A3B is a serious open MoE model, but local serving stability still depends heavily on sampler settings, template handling, backend version, and quant choice
- –The user's config keeps reasoning on with unlimited budget, making repeated internal reasoning especially expensive when the model starts cycling
- –Q4_K_XL may preserve quality better than smaller IQ quants, but the speed drop from 40 tok/s to 27 tok/s can erase that benefit for interactive use
- –Recent community chatter around Qwen3.6 points to backend quirks in speculative decoding, tool calls, and recurrent-state handling, so upgrading llama.cpp/TurboQuant builds may matter as much as sampler tweaks
// TAGS
qwen3.6-35b-a3bllmreasoninginferencegpuself-hostedopen-weights
DISCOVERED
4h ago
2026-04-23
PUBLISHED
5h ago
2026-04-23
RELEVANCE
6/ 10
AUTHOR
EggDroppedSoup