YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.6-35B-A3B crowns 12GB sweet spot

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.6-35B-A3B crowns 12GB sweet spot
OPEN LINK ↗
// 4h agoBENCHMARK RESULT

Qwen3.6-35B-A3B crowns 12GB sweet spot

A Reddit benchmark on an RTX 3060 12GB shows Qwen3.6-35B-A3B is surprisingly practical locally, especially with tuned `-ncmoe` and q8 KV cache. The author reports ~46-47 tok/s decode and says 32k context is usable without falling off the VRAM cliff.

// ANALYSIS

The interesting part is not the raw speed, it’s that a 35B MoE model crosses from “theoretical” to “daily-driver” territory on 12GB if you tune offload carefully.

  • The sweet spot appears to be `-ncmoe 18-20`; pushing to `16` triggers a sharp performance cliff
  • q8 KV cache is effectively free here, so the usual memory-speed tradeoff leans toward higher-cache precision
  • Plain decoding already lands around 46 tok/s, which makes MTP only a marginal upgrade in this setup
  • The practical win is context: 16k to 32k feels achievable instead of being a benchmark-only configuration
  • For local coding, this is a better signal than headline benchmark charts because it reflects the real constraint: VRAM, not just FLOPS
// TAGS
llmmoequantizationinferencebenchmarkqwen3-6-35b-a3b

DISCOVERED

4h ago

2026-05-09

PUBLISHED

7h ago

2026-05-08

RELEVANCE

9/ 10

AUTHOR

jwestra