BACK_TO_FEEDAICRIER_2
FlashAttention-4 hits 1,613 TFLOPs/s on NVIDIA Blackwell
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoOPENSOURCE RELEASE

FlashAttention-4 hits 1,613 TFLOPs/s on NVIDIA Blackwell

FlashAttention-4 is a significant performance update optimized for NVIDIA Hopper and Blackwell architectures, achieving record-breaking throughput by leveraging hardware-specific features like TMEM and async TMA. Written entirely in Python via NVIDIA's CuTe-DSL, it effectively eliminates the "softmax bottleneck" on Blackwell GPUs, bringing attention speeds nearly to the level of theoretical matmul limits.

// ANALYSIS

FlashAttention-4 represents a paradigm shift in kernel development by moving from verbose CUDA C++ to a high-performance Python-based DSL without sacrificing runtime efficiency.

  • Reaches 71% hardware utilization on B200 GPUs, outperforming Triton by up to 2.7x and cuDNN 9.13 by 1.3x.
  • Introduces selective rescaling and software-emulated exponentials to solve architecture-specific bottlenecks on next-gen hardware.
  • Massive developer velocity gains: CuTe-DSL reduces compilation time from 55 seconds to just 2.5 seconds, enabling rapid iteration.
  • Strict hardware requirements (Hopper/Blackwell only) underscore the growing divergence between elite enterprise hardware and consumer/legacy GPUs.
  • Seamless integration with vLLM 0.17.0 and PyTorch FlexAttention provides immediate performance gains for Blackwell-based inference clusters.
// TAGS
flashattention-4gpuinferenceopen-sourcellmbenchmarkresearch

DISCOVERED

19d ago

2026-03-24

PUBLISHED

19d ago

2026-03-24

RELEVANCE

10/ 10

AUTHOR

Sensitive-Two9732