OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoTUTORIAL
MXFP8 GEMM hits 99% cuBLAS performance
Daniel Vega-Myhre's new post walks through a Blackwell MXFP8 GEMM kernel built with CUDA + PTX, showing it can hit up to 99% of cuBLAS on favorable shapes. The write-up is a practical engineering diary, tracing the scaling rules, TMEM constraints, and optimization passes that move the kernel from 35% of cuBLAS to near parity.
// ANALYSIS
This is one of those rare benchmark posts where the engineering detail is the point. MXFP8 looks simple on paper, but the real challenge is orchestrating memory, TMEM, and synchronization so the hardware can actually realize the format's promise.
- –1x32 block scaling and e8m0 scales make MXFP8 more precise than coarse FP8 schemes, but they also impose strict layout and residency requirements
- –The optimization ladder matters more than the final number: vectorized stores, larger MMA tiles, multicast, Hilbert scheduling, and store-path tweaks are what close the gap
- –The stubborn 4096^3 case is a useful reminder that "up to 99%" is benchmark language, not a universal guarantee
- –The accompanying code makes this a strong reference for anyone building Blackwell kernels or poking at PTX features beyond what CUDA exposes directly
// TAGS
mxfp8-gemmpytorchgpubenchmarkresearch
DISCOVERED
12d ago
2026-03-30
PUBLISHED
13d ago
2026-03-30
RELEVANCE
8/ 10
AUTHOR
Benlus