BACK_TO_FEEDAICRIER_2
Qwen3.5 vLLM Docker config lands on 6000 Pro
OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoINFRASTRUCTURE

Qwen3.5 vLLM Docker config lands on 6000 Pro

Ian Hailey's GitHub repo packages a Docker/vLLM stack for serving Sehyo/Qwen3.5-122B-A10B-NVFP4 on a single RTX PRO 6000 Blackwell with a 262k-token context window. It adds Multi-Token Prediction (MTP) speculative decoding and a GPU-specific tuning script, turning the setup into a reproducible inference recipe rather than a one-off config dump.

// ANALYSIS

This is a useful field report, not a splashy launch: it shows how much work it still takes to get a huge NVFP4 MoE to behave on Blackwell, and why nightly software plus hardware-specific tuning are still part of the game. The payoff is that the repo bundles the messy bits into something other Blackwell owners can actually reproduce.

  • The repo proves that a 122B MoE with a 262k context window can be made practical on a single workstation-class Blackwell GPU, which is a meaningful milestone for local serving.
  • `vllm/vllm-openai:nightly` plus a forced Transformers reinstall suggests support for these quantized Qwen builds is still moving fast and not fully settled in stable releases.
  • The optional `mtp_tune` workflow brute-forces 1,920 fused MoE kernel configurations and writes a device-specific JSON, so this is as much a calibration tool as a Docker example.
  • Benchmark notes show the tradeoff clearly: MTP improves single-request token latency, but two-request concurrency can hammer throughput, so this is tuned for solo interactive use rather than shared traffic.
  • Since the repo is public, it gives other Blackwell owners a practical starting point instead of forcing them to reverse-engineer the stack.
// TAGS
vllm-docker-qwen3-5-122b-a10b-nvfp4llminferencegpuself-hostedbenchmark

DISCOVERED

20d ago

2026-03-22

PUBLISHED

20d ago

2026-03-22

RELEVANCE

8/ 10

AUTHOR

1-a-n