BACK_TO_FEEDAICRIER_2
Qwen3.6 35B A3B Coheres Better on Q8
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoMODEL RELEASE

Qwen3.6 35B A3B Coheres Better on Q8

A LocalLLaMA user says Qwen3.6-35B-A3B fell apart in a low-bit IQ4_XS quant, but became rock solid after moving to an Unsloth UD Q8 build, even with throughput cut to about 40 tok/s on a 24GB card. The model then stayed coherent through dozens of agent tool calls, including a self-written web-search extension.

// ANALYSIS

This reads less like a benchmark and more like a reminder that agentic coding punishes lossy quantization hard. For long tool-heavy sessions, quality and memory plumbing can matter more than raw speed.

  • Qwen’s own model card emphasizes agentic coding and “thinking preservation,” so the report fits the release’s intended use case
  • The contrast between IQ4_XS and Q8 suggests ultra-low-bit quants may be fine for chat, but still too brittle for sustained agent loops
  • On 24GB VRAM, the real tradeoff is reliability versus latency: Q8 plus CPU MoE offload is slower, but apparently far steadier
  • The llama.cpp serving flags matter here too; context handling, MTP, and MoE offload choices can change whether the model stays on track
  • If this holds up across more users, Qwen3.6 looks more compelling as a local agent model than as a pure throughput play
// TAGS
qwen3.6-35b-a3bllmagentai-codinginferenceopen-source

DISCOVERED

4h ago

2026-04-21

PUBLISHED

8h ago

2026-04-21

RELEVANCE

9/ 10

AUTHOR

s1mplyme