OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoBENCHMARK RESULT
NVIDIA CUTLASS kernels broken on SM120, cap Qwen3.5-397B at 50 tok/s
A developer benchmarked 16 inference configurations of Qwen3.5-397B-A17B-NVFP4 across 4x RTX PRO 6000 (SM120) workstation GPUs, finding a hard ceiling at 50.5 tok/s — roughly half the theoretical maximum — caused by a confirmed NVIDIA CUTLASS bug that makes all 80 FP4 tensor core kernel tactics fail at initialization on SM120 hardware.
// ANALYSIS
NVIDIA is selling $20K+ AI workstation GPUs bundled with NVFP4-quantized models, but the inference kernels designed to use them are silently broken on their own silicon — that's a product integrity failure, not a user error.
- –Root cause: all 80 TMA Warp Specialized grouped GEMM tactics fail on SM120 (RTX PRO 6000 / RTX 5090), forcing fallback to Marlin W4A16 dequantization and leaving FP4 tensor cores idle
- –SM121 (DGX Spark) runs native FP4 MoE at 356 TFLOPS with no issue — NVIDIA has simply not validated SM120 tile configs despite shipping virtually identical consumer/workstation hardware
- –Multi-Token Prediction delivers a -22% regression on SM120 because Marlin's dequantization produces activation values that diverge from what MTP draft heads expect, dropping acceptance rates to 61-85%
- –Community claims of 130-150 tok/s on identical hardware appear to measure speculative tokens (proposed + rejected) rather than delivered output — a benchmarking methodology issue that is spreading misinformation
- –The practical workaround (`VLLM_MOE_FORCE_MARLIN=1`, TP=4, no MTP, no expert parallel) delivers 50.5 tok/s sustained — still faster than most Llama 70B setups but well short of what FP4 tensor cores should provide
// TAGS
llminferencebenchmarkgpuopen-weightsvllm
DISCOVERED
29d ago
2026-03-14
PUBLISHED
31d ago
2026-03-12
RELEVANCE
7/ 10
AUTHOR
lawdawgattorney