BACK_TO_FEEDAICRIER_2
NVIDIA CUTLASS kernels broken on SM120, cap Qwen3.5-397B at 50 tok/s
OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoBENCHMARK RESULT

NVIDIA CUTLASS kernels broken on SM120, cap Qwen3.5-397B at 50 tok/s

A developer benchmarked 16 inference configurations of Qwen3.5-397B-A17B-NVFP4 across 4x RTX PRO 6000 (SM120) workstation GPUs, finding a hard ceiling at 50.5 tok/s — roughly half the theoretical maximum — caused by a confirmed NVIDIA CUTLASS bug that makes all 80 FP4 tensor core kernel tactics fail at initialization on SM120 hardware.

// ANALYSIS

NVIDIA is selling $20K+ AI workstation GPUs bundled with NVFP4-quantized models, but the inference kernels designed to use them are silently broken on their own silicon — that's a product integrity failure, not a user error.

  • Root cause: all 80 TMA Warp Specialized grouped GEMM tactics fail on SM120 (RTX PRO 6000 / RTX 5090), forcing fallback to Marlin W4A16 dequantization and leaving FP4 tensor cores idle
  • SM121 (DGX Spark) runs native FP4 MoE at 356 TFLOPS with no issue — NVIDIA has simply not validated SM120 tile configs despite shipping virtually identical consumer/workstation hardware
  • Multi-Token Prediction delivers a -22% regression on SM120 because Marlin's dequantization produces activation values that diverge from what MTP draft heads expect, dropping acceptance rates to 61-85%
  • Community claims of 130-150 tok/s on identical hardware appear to measure speculative tokens (proposed + rejected) rather than delivered output — a benchmarking methodology issue that is spreading misinformation
  • The practical workaround (`VLLM_MOE_FORCE_MARLIN=1`, TP=4, no MTP, no expert parallel) delivers 50.5 tok/s sustained — still faster than most Llama 70B setups but well short of what FP4 tensor cores should provide
// TAGS
llminferencebenchmarkgpuopen-weightsvllm

DISCOVERED

29d ago

2026-03-14

PUBLISHED

31d ago

2026-03-12

RELEVANCE

7/ 10

AUTHOR

lawdawgattorney