YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.6 35B A3B Coheres Better on Q8

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.6 35B A3B Coheres Better on Q8
OPEN LINK ↗
// 53d agoMODEL RELEASE

Qwen3.6 35B A3B Coheres Better on Q8

A LocalLLaMA user says Qwen3.6-35B-A3B fell apart in a low-bit IQ4_XS quant, but became rock solid after moving to an Unsloth UD Q8 build, even with throughput cut to about 40 tok/s on a 24GB card. The model then stayed coherent through dozens of agent tool calls, including a self-written web-search extension.

// ANALYSIS

This reads less like a benchmark and more like a reminder that agentic coding punishes lossy quantization hard. For long tool-heavy sessions, quality and memory plumbing can matter more than raw speed.

  • Qwen’s own model card emphasizes agentic coding and “thinking preservation,” so the report fits the release’s intended use case
  • The contrast between IQ4_XS and Q8 suggests ultra-low-bit quants may be fine for chat, but still too brittle for sustained agent loops
  • On 24GB VRAM, the real tradeoff is reliability versus latency: Q8 plus CPU MoE offload is slower, but apparently far steadier
  • The llama.cpp serving flags matter here too; context handling, MTP, and MoE offload choices can change whether the model stays on track
  • If this holds up across more users, Qwen3.6 looks more compelling as a local agent model than as a pure throughput play
// TAGS
qwen3.6-35b-a3bllmagentai-codinginferenceopen-source

DISCOVERED

53d ago

2026-04-21

PUBLISHED

53d ago

2026-04-21

RELEVANCE

9/ 10

AUTHOR

s1mplyme