Qwen3.6 users report reasoning loops

// 90d agoINFRASTRUCTURE

Qwen3.6 users report reasoning loops

A LocalLLaMA user says Unsloth's Q4_K_XL GGUF quant of Qwen3.6-35B-A3B is slower than IQ4_XS on their 8GB VRAM setup and appears more prone to looping during reasoning. The thread is more troubleshooting signal than news, but it highlights the practical tradeoffs local users face when chasing lower KLD quants.

// ANALYSIS

This is the messy underside of open-weight inference: better quant metrics do not automatically mean better wall-clock behavior, especially with reasoning mode, MoE routing, huge context, CPU offload, and fork-specific llama.cpp behavior in the mix.

–Qwen3.6-35B-A3B is a serious open MoE model, but local serving stability still depends heavily on sampler settings, template handling, backend version, and quant choice
–The user's config keeps reasoning on with unlimited budget, making repeated internal reasoning especially expensive when the model starts cycling
–Q4_K_XL may preserve quality better than smaller IQ quants, but the speed drop from 40 tok/s to 27 tok/s can erase that benefit for interactive use
–Recent community chatter around Qwen3.6 points to backend quirks in speculative decoding, tool calls, and recurrent-state handling, so upgrading llama.cpp/TurboQuant builds may matter as much as sampler tweaks

// TAGS

qwen3.6-35b-a3bllmreasoninginferencegpuself-hostedopen-weights

DISCOVERED

90d ago

2026-04-23

PUBLISHED

90d ago

2026-04-23

RELEVANCE

6/ 10

AUTHOR

EggDroppedSoup

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS1h ago

Google allocates massive compute to Gemini 4

Google CEO Sundar Pichai announced that the company is allocating substantial compute capacity to build Gemini 4, a significantly larger foundation model designed to push the boundaries of frontier AI. The move underlines Google's commitment to scaling its AI infrastructure to maintain leadership in state-of-the-art AI development and performance.

MODEL1h ago

Researchers unveil OMG-VLM for multimodal graph processing

OMG-VLM is a newly unveiled open-source vision-language model designed specifically for processing multimodal graphs containing text and image elements. By making the model open source, researchers aim to enhance multimodal data analysis and facilitate advanced visual-textual graph processing across various research and domain applications.

UPDATE2h ago

Saravia Builds DAIR.AI Interface via Fable 5, GPT-5.6

Elvis Saravia (@omarsar0) demonstrated a multi-model workflow for building a new DAIR.AI community interface. He brainstormed concept designs with Fable 5 to produce an HTML artifact, which was then passed to GPT-5.6-Sol to construct the final interface.