Qwen 3.6 hits 100 t/s on dual RTX 3090s via vLLM

// 90d agoTUTORIAL

Qwen 3.6 hits 100 t/s on dual RTX 3090s via vLLM

A community-tested deployment guide for serving the open-weight Qwen 3.6-35B-A3B model using vLLM and Docker. By leveraging tensor parallelism and the new multi-token prediction speculative decoding, the setup achieves high-throughput local inference with a 64k context window on consumer-grade hardware.

// ANALYSIS

This configuration proves that Qwen 3.6's MoE architecture is a perfect match for local high-speed serving on dual 24GB GPUs.

–Tensor parallelism is essential for splitting the 35B model across dual 3090s, preventing OOM errors while maximizing memory bandwidth.
–Multi-token prediction (`qwen3_next_mtp`) via speculative decoding provides a massive throughput boost, reaching over 100 t/s for standard generation.
–The combination of AWQ 4-bit quantization and prefix caching enables a massive 63,000 token context window, though prompt processing time scales significantly at the limit.
–Using `ipc: host` is a critical optimization for Docker-based vLLM setups to avoid shared memory bottlenecks during GPU-to-GPU communication.
–Benchmark results show impressive scaling, maintaining double-digit throughput even at maximum context lengths.

// TAGS

qwen-3-6vllmgpuself-hostedinferencedockernvidiaopen-source

DISCOVERED

90d ago

2026-04-18

PUBLISHED

90d ago

2026-04-18

RELEVANCE

8/ 10

AUTHOR

Zyj

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL49m ago

Basalt Labs drops 1.57T MoE Monolith-1.0

Basalt Labs has released Monolith-1.0, an open-weight 1.57-trillion-parameter Mixture-of-Experts reasoning model under the MIT license. Trained on 60 trillion tokens, the model supports a native 1-million-token context window and integrates grouped-query attention, fine-grained routing, and multi-token prediction heads.

POLICY59m ago

White House launches Gold Eagle initiative

The White House has launched Gold Eagle, an AI-powered federal clearinghouse to accelerate software vulnerability triage and remediation across government systems and critical infrastructure. Alongside vulnerability patching, the initiative reportedly marks a shift toward federal gatekeeping of frontier AI model access.

UPDATE59m ago

Moonshot AI teases Kimi K3.1

Moonshot AI has teased Kimi K3.1, an upcoming update to its flagship 2.8-trillion-parameter Kimi K3 model. The tease was posted shortly after the release of Kimi K3, in response to feedback about its user experience.