OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoOPENSOURCE RELEASE
FlashAttention-4 hits 1,613 TFLOPs/s on NVIDIA Blackwell
FlashAttention-4 is a significant performance update optimized for NVIDIA Hopper and Blackwell architectures, achieving record-breaking throughput by leveraging hardware-specific features like TMEM and async TMA. Written entirely in Python via NVIDIA's CuTe-DSL, it effectively eliminates the "softmax bottleneck" on Blackwell GPUs, bringing attention speeds nearly to the level of theoretical matmul limits.
// ANALYSIS
FlashAttention-4 represents a paradigm shift in kernel development by moving from verbose CUDA C++ to a high-performance Python-based DSL without sacrificing runtime efficiency.
- –Reaches 71% hardware utilization on B200 GPUs, outperforming Triton by up to 2.7x and cuDNN 9.13 by 1.3x.
- –Introduces selective rescaling and software-emulated exponentials to solve architecture-specific bottlenecks on next-gen hardware.
- –Massive developer velocity gains: CuTe-DSL reduces compilation time from 55 seconds to just 2.5 seconds, enabling rapid iteration.
- –Strict hardware requirements (Hopper/Blackwell only) underscore the growing divergence between elite enterprise hardware and consumer/legacy GPUs.
- –Seamless integration with vLLM 0.17.0 and PyTorch FlexAttention provides immediate performance gains for Blackwell-based inference clusters.
// TAGS
flashattention-4gpuinferenceopen-sourcellmbenchmarkresearch
DISCOVERED
19d ago
2026-03-24
PUBLISHED
19d ago
2026-03-24
RELEVANCE
10/ 10
AUTHOR
Sensitive-Two9732