OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoMODEL RELEASE
Qwen3.6 35B A3B Coheres Better on Q8
A LocalLLaMA user says Qwen3.6-35B-A3B fell apart in a low-bit IQ4_XS quant, but became rock solid after moving to an Unsloth UD Q8 build, even with throughput cut to about 40 tok/s on a 24GB card. The model then stayed coherent through dozens of agent tool calls, including a self-written web-search extension.
// ANALYSIS
This reads less like a benchmark and more like a reminder that agentic coding punishes lossy quantization hard. For long tool-heavy sessions, quality and memory plumbing can matter more than raw speed.
- –Qwen’s own model card emphasizes agentic coding and “thinking preservation,” so the report fits the release’s intended use case
- –The contrast between IQ4_XS and Q8 suggests ultra-low-bit quants may be fine for chat, but still too brittle for sustained agent loops
- –On 24GB VRAM, the real tradeoff is reliability versus latency: Q8 plus CPU MoE offload is slower, but apparently far steadier
- –The llama.cpp serving flags matter here too; context handling, MTP, and MoE offload choices can change whether the model stays on track
- –If this holds up across more users, Qwen3.6 looks more compelling as a local agent model than as a pure throughput play
// TAGS
qwen3.6-35b-a3bllmagentai-codinginferenceopen-source
DISCOVERED
4h ago
2026-04-21
PUBLISHED
8h ago
2026-04-21
RELEVANCE
9/ 10
AUTHOR
s1mplyme