OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Qwen3.6-35B-A3B hits 21.7 tok/s on consumer GPUs
Local LLM benchmarks reveal that Qwen 3.6-35B-A3B, a sparse Mixture-of-Experts model, achieves 21.7 tokens/second on dual RTX 5060 Ti GPUs using hybrid offloading. The model successfully bridges the gap between high parameter counts and consumer hardware, excelling in agentic coding tasks with a 73.4% SWE-bench Verified score.
// ANALYSIS
Sparse MoE architectures are making high-end reasoning viable on consumer-grade setups, though prompt processing remains a significant bottleneck compared to dense models.
- –Hybrid offloading (--cpu-moe) provides "free" performance gains by offloading inactive experts to system RAM without sacrificing generation speed.
- –The model shows a major reasoning leap, outperforming the Qwen 3.5 dense variant by a substantial margin in agentic benchmarks like Terminal-Bench 2.0.
- –PCIe bandwidth limits ingestion efficiency, leaving dense models with a nearly 2x advantage in prompt processing speeds.
- –Technical stability remains a challenge; current custom llama.cpp builds crash when combining Gated Delta Net optimizations with hybrid offloading.
- –Real-world tests confirm autonomous reliability, with the model successfully completing multi-step tool calls for infrastructure automation.
// TAGS
qwen3.6-35b-a3bllmbenchmarkgpuinferenceopen-weightsreasoningai-coding
DISCOVERED
3h ago
2026-04-17
PUBLISHED
6h ago
2026-04-17
RELEVANCE
9/ 10
AUTHOR
Defilan