NVIDIA CUTLASS kernels broken on SM120, cap Qwen3.5-397B at 50 tok/s

// 120d agoBENCHMARK RESULT

NVIDIA CUTLASS kernels broken on SM120, cap Qwen3.5-397B at 50 tok/s

A developer benchmarked 16 inference configurations of Qwen3.5-397B-A17B-NVFP4 across 4x RTX PRO 6000 (SM120) workstation GPUs, finding a hard ceiling at 50.5 tok/s — roughly half the theoretical maximum — caused by a confirmed NVIDIA CUTLASS bug that makes all 80 FP4 tensor core kernel tactics fail at initialization on SM120 hardware.

// ANALYSIS

NVIDIA is selling $20K+ AI workstation GPUs bundled with NVFP4-quantized models, but the inference kernels designed to use them are silently broken on their own silicon — that's a product integrity failure, not a user error.

–Root cause: all 80 TMA Warp Specialized grouped GEMM tactics fail on SM120 (RTX PRO 6000 / RTX 5090), forcing fallback to Marlin W4A16 dequantization and leaving FP4 tensor cores idle
–SM121 (DGX Spark) runs native FP4 MoE at 356 TFLOPS with no issue — NVIDIA has simply not validated SM120 tile configs despite shipping virtually identical consumer/workstation hardware
–Multi-Token Prediction delivers a -22% regression on SM120 because Marlin's dequantization produces activation values that diverge from what MTP draft heads expect, dropping acceptance rates to 61-85%
–Community claims of 130-150 tok/s on identical hardware appear to measure speculative tokens (proposed + rejected) rather than delivered output — a benchmarking methodology issue that is spreading misinformation
–The practical workaround (`VLLM_MOE_FORCE_MARLIN=1`, TP=4, no MTP, no expert parallel) delivers 50.5 tok/s sustained — still faster than most Llama 70B setups but well short of what FP4 tensor cores should provide

// TAGS

llminferencebenchmarkgpuopen-weightsvllm

DISCOVERED

120d ago

2026-03-14

PUBLISHED

123d ago

2026-03-12

RELEVANCE

7/ 10

AUTHOR

lawdawgattorney

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE33m ago

Grok Build adds multiline input, scrolling

SpaceXAI has released Grok Build versions 0.2.99 and 0.2.98, introducing multiline input and terminal scrolling for its terminal-based AI coding assistant. The updates allow users to input complex prompts directly on the dashboard and scroll through chat histories using PageUp and PageDown.

INFRA1h ago

GLM-5 runs natively on Ascend via FlagOS

Zhipu AI's GLM-5 has been packaged for native execution on Huawei Ascend NPUs using the FlagOS framework, representing the first CUDA-free deployment of a Chinese general-purpose LLM on domestic hardware. This integration satisfies local sovereignty requirements across hardware, model, and inference runtime in a single package.

INFRA1h ago

Alchemy enables declarative agentic infrastructure

Sam Goodwin shared a declarative workflow for constructing agentic infrastructure using Alchemy, combining English prompts and TypeScript code in a single TypeScript file. By utilizing string template literals and a simple alchemy deploy command, developers can deploy applications directly to the cloud without manual environment setup.