Qwen3.5 vLLM Docker config lands on 6000 Pro
Ian Hailey's GitHub repo packages a Docker/vLLM stack for serving Sehyo/Qwen3.5-122B-A10B-NVFP4 on a single RTX PRO 6000 Blackwell with a 262k-token context window. It adds Multi-Token Prediction (MTP) speculative decoding and a GPU-specific tuning script, turning the setup into a reproducible inference recipe rather than a one-off config dump.
This is a useful field report, not a splashy launch: it shows how much work it still takes to get a huge NVFP4 MoE to behave on Blackwell, and why nightly software plus hardware-specific tuning are still part of the game. The payoff is that the repo bundles the messy bits into something other Blackwell owners can actually reproduce.
- –The repo proves that a 122B MoE with a 262k context window can be made practical on a single workstation-class Blackwell GPU, which is a meaningful milestone for local serving.
- –`vllm/vllm-openai:nightly` plus a forced Transformers reinstall suggests support for these quantized Qwen builds is still moving fast and not fully settled in stable releases.
- –The optional `mtp_tune` workflow brute-forces 1,920 fused MoE kernel configurations and writes a device-specific JSON, so this is as much a calibration tool as a Docker example.
- –Benchmark notes show the tradeoff clearly: MTP improves single-request token latency, but two-request concurrency can hammer throughput, so this is tuned for solo interactive use rather than shared traffic.
- –Since the repo is public, it gives other Blackwell owners a practical starting point instead of forcing them to reverse-engineer the stack.
DISCOVERED
20d ago
2026-03-22
PUBLISHED
20d ago
2026-03-22
RELEVANCE
AUTHOR
1-a-n