OPEN_SOURCE ↗
REDDIT · REDDIT// 34d agoINFRASTRUCTURE
vLLM workaround fixes Blackwell NVFP4 MoE
A Reddit post from the LocalLLaMA community documents a practical vLLM workaround for broken NVFP4 MoE kernels on RTX 5090 and RTX PRO 6000 Blackwell GPUs: force `--moe-backend marlin` and run Qwen3.5-397B with TP=2 plus PP=2. The result is correct output and roughly 46-49 tok/s on a 4-GPU PCIe setup, versus garbage output or much lower throughput with the default CUTLASS-based paths.
// ANALYSIS
This is exactly the kind of field report infra teams care about: not a flashy launch, but a concrete escape hatch when bleeding-edge quantized MoE serving breaks on new hardware.
- –The core insight is that Marlin’s W4A16 path sidesteps broken SM120 FP4 GEMM kernels by dequantizing weights to FP16 before compute
- –The post argues communication topology matters as much as kernels here: TP=2 plus PP=2 beats TP=4 on PCIe by sharply reducing all-reduce overhead
- –It is especially useful for anyone trying to serve huge Qwen3.5 NVFP4 MoE models on workstation-class Blackwell boxes without NVLink
- –This also highlights a recurring inference-stack reality: new GPU support often lands in layers, with device checks passing before low-level kernels are actually stable
// TAGS
vllmllminferencegpuopen-source
DISCOVERED
34d ago
2026-03-08
PUBLISHED
34d ago
2026-03-08
RELEVANCE
8/ 10
AUTHOR
lawdawgattorney