BACK_TO_FEEDAICRIER_2
Qwen 3.6 hits 100 t/s on dual RTX 3090s via vLLM
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoTUTORIAL

Qwen 3.6 hits 100 t/s on dual RTX 3090s via vLLM

A community-tested deployment guide for serving the open-weight Qwen 3.6-35B-A3B model using vLLM and Docker. By leveraging tensor parallelism and the new multi-token prediction speculative decoding, the setup achieves high-throughput local inference with a 64k context window on consumer-grade hardware.

// ANALYSIS

This configuration proves that Qwen 3.6's MoE architecture is a perfect match for local high-speed serving on dual 24GB GPUs.

  • Tensor parallelism is essential for splitting the 35B model across dual 3090s, preventing OOM errors while maximizing memory bandwidth.
  • Multi-token prediction (`qwen3_next_mtp`) via speculative decoding provides a massive throughput boost, reaching over 100 t/s for standard generation.
  • The combination of AWQ 4-bit quantization and prefix caching enables a massive 63,000 token context window, though prompt processing time scales significantly at the limit.
  • Using `ipc: host` is a critical optimization for Docker-based vLLM setups to avoid shared memory bottlenecks during GPU-to-GPU communication.
  • Benchmark results show impressive scaling, maintaining double-digit throughput even at maximum context lengths.
// TAGS
qwen-3-6vllmgpuself-hostedinferencedockernvidiaopen-source

DISCOVERED

4h ago

2026-04-18

PUBLISHED

4h ago

2026-04-18

RELEVANCE

8/ 10

AUTHOR

Zyj