Qwen3.6 35B A3B Coheres Better on Q8
A LocalLLaMA user says Qwen3.6-35B-A3B fell apart in a low-bit IQ4_XS quant, but became rock solid after moving to an Unsloth UD Q8 build, even with throughput cut to about 40 tok/s on a 24GB card. The model then stayed coherent through dozens of agent tool calls, including a self-written web-search extension.
This reads less like a benchmark and more like a reminder that agentic coding punishes lossy quantization hard. For long tool-heavy sessions, quality and memory plumbing can matter more than raw speed.
- –Qwen’s own model card emphasizes agentic coding and “thinking preservation,” so the report fits the release’s intended use case
- –The contrast between IQ4_XS and Q8 suggests ultra-low-bit quants may be fine for chat, but still too brittle for sustained agent loops
- –On 24GB VRAM, the real tradeoff is reliability versus latency: Q8 plus CPU MoE offload is slower, but apparently far steadier
- –The llama.cpp serving flags matter here too; context handling, MTP, and MoE offload choices can change whether the model stays on track
- –If this holds up across more users, Qwen3.6 looks more compelling as a local agent model than as a pure throughput play
DISCOVERED
53d ago
2026-04-21
PUBLISHED
53d ago
2026-04-21
RELEVANCE
AUTHOR
s1mplyme