MXFP8 GEMM hits 99% cuBLAS performance

// 103d agoTUTORIAL

MXFP8 GEMM hits 99% cuBLAS performance

Daniel Vega-Myhre's new post walks through a Blackwell MXFP8 GEMM kernel built with CUDA + PTX, showing it can hit up to 99% of cuBLAS on favorable shapes. The write-up is a practical engineering diary, tracing the scaling rules, TMEM constraints, and optimization passes that move the kernel from 35% of cuBLAS to near parity.

// ANALYSIS

This is one of those rare benchmark posts where the engineering detail is the point. MXFP8 looks simple on paper, but the real challenge is orchestrating memory, TMEM, and synchronization so the hardware can actually realize the format's promise.

–1x32 block scaling and e8m0 scales make MXFP8 more precise than coarse FP8 schemes, but they also impose strict layout and residency requirements
–The optimization ladder matters more than the final number: vectorized stores, larger MMA tiles, multicast, Hilbert scheduling, and store-path tweaks are what close the gap
–The stubborn 4096^3 case is a useful reminder that "up to 99%" is benchmark language, not a universal guarantee
–The accompanying code makes this a strong reference for anyone building Blackwell kernels or poking at PTX features beyond what CUDA exposes directly

// TAGS

mxfp8-gemmpytorchgpubenchmarkresearch

DISCOVERED

103d ago

2026-03-30

PUBLISHED

103d ago

2026-03-30

RELEVANCE

8/ 10

AUTHOR

Benlus

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL1h ago

Qwythos-9B v2 fixes LLM repetition loops

Empero AI has launched the v2 hygiene release of Qwythos-9B, an open-source, 9-billion parameter reasoning model built on an uncensored Qwen3.5 base. This update addresses common local LLM repetition and tool-calling issues by employing Final-Token Preference Optimization to eliminate decoding loops under greedy settings and restoring the native multi-token prediction head.

OPEN SOURCE3h ago

meshoptimizer is an open-source C/C++ library that optimizes 3D triangle meshes to reduce file sizes and accelerate GPU rendering performance.

meshoptimizer is a high-performance C/C++ library designed to optimize 3D meshes for faster rendering and smaller file sizes. Developed by Arseny Kapoulkine, it provides a comprehensive suite of algorithms for vertex cache optimization, vertex fetch optimization, overdraw reduction, mesh simplification (Level of Detail), and data compression. The project includes gltfpack, an opinionated tool for optimizing glTF scenes, along with WebAssembly and JavaScript bindings for web applications, making it a staple in graphics pipelines and game engines.

UPDATE4h ago

Abacus AI integrates Supercomputer with agentic workflows

Abacus AI has integrated its Supercomputer with agentic workflows in Max Mode, giving LLMs like Fable 5 root access to a persistent Linux environment to execute, debug, and host full-stack applications autonomously.