Qwen 3.6 hits 100 t/s on dual RTX 3090s via vLLM
A community-tested deployment guide for serving the open-weight Qwen 3.6-35B-A3B model using vLLM and Docker. By leveraging tensor parallelism and the new multi-token prediction speculative decoding, the setup achieves high-throughput local inference with a 64k context window on consumer-grade hardware.
This configuration proves that Qwen 3.6's MoE architecture is a perfect match for local high-speed serving on dual 24GB GPUs.
- –Tensor parallelism is essential for splitting the 35B model across dual 3090s, preventing OOM errors while maximizing memory bandwidth.
- –Multi-token prediction (`qwen3_next_mtp`) via speculative decoding provides a massive throughput boost, reaching over 100 t/s for standard generation.
- –The combination of AWQ 4-bit quantization and prefix caching enables a massive 63,000 token context window, though prompt processing time scales significantly at the limit.
- –Using `ipc: host` is a critical optimization for Docker-based vLLM setups to avoid shared memory bottlenecks during GPU-to-GPU communication.
- –Benchmark results show impressive scaling, maintaining double-digit throughput even at maximum context lengths.
DISCOVERED
45d ago
2026-04-18
PUBLISHED
45d ago
2026-04-18
RELEVANCE
AUTHOR
Zyj