YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.6-35B-A3B hits 21.7 tok/s on consumer GPUs

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.6-35B-A3B hits 21.7 tok/s on consumer GPUs
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Qwen3.6-35B-A3B hits 21.7 tok/s on consumer GPUs

Local LLM benchmarks reveal that Qwen 3.6-35B-A3B, a sparse Mixture-of-Experts model, achieves 21.7 tokens/second on dual RTX 5060 Ti GPUs using hybrid offloading. The model successfully bridges the gap between high parameter counts and consumer hardware, excelling in agentic coding tasks with a 73.4% SWE-bench Verified score.

// ANALYSIS

Sparse MoE architectures are making high-end reasoning viable on consumer-grade setups, though prompt processing remains a significant bottleneck compared to dense models.

  • Hybrid offloading (--cpu-moe) provides "free" performance gains by offloading inactive experts to system RAM without sacrificing generation speed.
  • The model shows a major reasoning leap, outperforming the Qwen 3.5 dense variant by a substantial margin in agentic benchmarks like Terminal-Bench 2.0.
  • PCIe bandwidth limits ingestion efficiency, leaving dense models with a nearly 2x advantage in prompt processing speeds.
  • Technical stability remains a challenge; current custom llama.cpp builds crash when combining Gated Delta Net optimizations with hybrid offloading.
  • Real-world tests confirm autonomous reliability, with the model successfully completing multi-step tool calls for infrastructure automation.
// TAGS
qwen3.6-35b-a3bllmbenchmarkgpuinferenceopen-weightsreasoningai-coding

DISCOVERED

45d ago

2026-04-17

PUBLISHED

45d ago

2026-04-17

RELEVANCE

9/ 10

AUTHOR

Defilan