BACK_TO_FEEDAICRIER_2
Qwen3.6-35B-A3B hits 21.7 tok/s on consumer GPUs
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Qwen3.6-35B-A3B hits 21.7 tok/s on consumer GPUs

Local LLM benchmarks reveal that Qwen 3.6-35B-A3B, a sparse Mixture-of-Experts model, achieves 21.7 tokens/second on dual RTX 5060 Ti GPUs using hybrid offloading. The model successfully bridges the gap between high parameter counts and consumer hardware, excelling in agentic coding tasks with a 73.4% SWE-bench Verified score.

// ANALYSIS

Sparse MoE architectures are making high-end reasoning viable on consumer-grade setups, though prompt processing remains a significant bottleneck compared to dense models.

  • Hybrid offloading (--cpu-moe) provides "free" performance gains by offloading inactive experts to system RAM without sacrificing generation speed.
  • The model shows a major reasoning leap, outperforming the Qwen 3.5 dense variant by a substantial margin in agentic benchmarks like Terminal-Bench 2.0.
  • PCIe bandwidth limits ingestion efficiency, leaving dense models with a nearly 2x advantage in prompt processing speeds.
  • Technical stability remains a challenge; current custom llama.cpp builds crash when combining Gated Delta Net optimizations with hybrid offloading.
  • Real-world tests confirm autonomous reliability, with the model successfully completing multi-step tool calls for infrastructure automation.
// TAGS
qwen3.6-35b-a3bllmbenchmarkgpuinferenceopen-weightsreasoningai-coding

DISCOVERED

3h ago

2026-04-17

PUBLISHED

6h ago

2026-04-17

RELEVANCE

9/ 10

AUTHOR

Defilan