vLLM hits Qwen3.6 MTP pipeline wall
A LocalLLaMA user reports that Qwen3.6 27B with vLLM’s MTP speculative decoding fails as soon as pipeline parallelism is enabled, throwing a `SupportsPP`/pipeline-parallelism error. The post points to an upstream vLLM bug and confirms the practical workaround is to remove either MTP or pipeline parallelism, which makes the setup usable but defeats the desired speedup on multi-GPU deployments.
The sharp edge here is not the model alone, but the interaction between vLLM’s speculative decoding path and PP support for this serving stack.
- –This looks like a genuine compatibility gap, not a misconfigured container, because the failure matches vLLM’s own PP support checks and the linked upstream bug.
- –The user-facing pain is real: PP is often needed to fit a 27B model across GPUs, while MTP is the lever that would make generation faster.
- –If you need the model to run today, the choice is basically “fit it with PP” or “speed it up with MTP,” but not both together in this configuration.
- –The post is most useful to operators evaluating vLLM for multi-GPU Qwen deployments, especially when comparing it with llama.cpp setups that may already fit their workflow.
DISCOVERED
45d ago
2026-04-27
PUBLISHED
45d ago
2026-04-26
RELEVANCE
AUTHOR
fragment_me