OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoTUTORIAL
Qwen 3.6 hits 100 t/s on dual RTX 3090s via vLLM
A community-tested deployment guide for serving the open-weight Qwen 3.6-35B-A3B model using vLLM and Docker. By leveraging tensor parallelism and the new multi-token prediction speculative decoding, the setup achieves high-throughput local inference with a 64k context window on consumer-grade hardware.
// ANALYSIS
This configuration proves that Qwen 3.6's MoE architecture is a perfect match for local high-speed serving on dual 24GB GPUs.
- –Tensor parallelism is essential for splitting the 35B model across dual 3090s, preventing OOM errors while maximizing memory bandwidth.
- –Multi-token prediction (`qwen3_next_mtp`) via speculative decoding provides a massive throughput boost, reaching over 100 t/s for standard generation.
- –The combination of AWQ 4-bit quantization and prefix caching enables a massive 63,000 token context window, though prompt processing time scales significantly at the limit.
- –Using `ipc: host` is a critical optimization for Docker-based vLLM setups to avoid shared memory bottlenecks during GPU-to-GPU communication.
- –Benchmark results show impressive scaling, maintaining double-digit throughput even at maximum context lengths.
// TAGS
qwen-3-6vllmgpuself-hostedinferencedockernvidiaopen-source
DISCOVERED
4h ago
2026-04-18
PUBLISHED
4h ago
2026-04-18
RELEVANCE
8/ 10
AUTHOR
Zyj