cuBLAS bug hits RTX 5090 performance
NVIDIA's cuBLAS library contains a critical dispatcher bug that slashes RTX 5090 performance by 60% in batched FP32 workloads. The software fails to escalate to optimized kernels on Blackwell consumer hardware, trapping the high-end GPU in inefficient legacy execution paths while professional counterparts receive proper heuristics.
This regression exposes a "silent tax" on consumer hardware where Blackwell GPUs are trapped in legacy execution paths while professional chips receive optimized kernels. The cuBLAS dispatcher for sm_120 fails to select larger tile sizes on RTX GPUs, defaulting to tiny 128x32 kernels that underutilize the hardware. A custom kernel leveraging the Tensor Memory Accelerator (TMA) beats cuBLAS by up to 70% using only 300 lines of code. This failure mirrors Pascal-era bugs where consumer cards were routed through Maxwell kernels, suggesting persistent legacy debt in NVIDIA's dispatch logic. While tensor core paths for FP16 remain fast, the FP32 SGEMM regression severely impacts scientific computing and non-tensor ML tasks.
DISCOVERED
1d ago
2026-04-10
PUBLISHED
1d ago
2026-04-10
RELEVANCE
AUTHOR
NoVibeCoding