YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen 3.6 hits 100 t/s on dual RTX 3090s via vLLM

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen 3.6 hits 100 t/s on dual RTX 3090s via vLLM
OPEN LINK ↗
// 45d agoTUTORIAL

Qwen 3.6 hits 100 t/s on dual RTX 3090s via vLLM

A community-tested deployment guide for serving the open-weight Qwen 3.6-35B-A3B model using vLLM and Docker. By leveraging tensor parallelism and the new multi-token prediction speculative decoding, the setup achieves high-throughput local inference with a 64k context window on consumer-grade hardware.

// ANALYSIS

This configuration proves that Qwen 3.6's MoE architecture is a perfect match for local high-speed serving on dual 24GB GPUs.

  • Tensor parallelism is essential for splitting the 35B model across dual 3090s, preventing OOM errors while maximizing memory bandwidth.
  • Multi-token prediction (`qwen3_next_mtp`) via speculative decoding provides a massive throughput boost, reaching over 100 t/s for standard generation.
  • The combination of AWQ 4-bit quantization and prefix caching enables a massive 63,000 token context window, though prompt processing time scales significantly at the limit.
  • Using `ipc: host` is a critical optimization for Docker-based vLLM setups to avoid shared memory bottlenecks during GPU-to-GPU communication.
  • Benchmark results show impressive scaling, maintaining double-digit throughput even at maximum context lengths.
// TAGS
qwen-3-6vllmgpuself-hostedinferencedockernvidiaopen-source

DISCOVERED

45d ago

2026-04-18

PUBLISHED

45d ago

2026-04-18

RELEVANCE

8/ 10

AUTHOR

Zyj