BACK_TO_FEEDAICRIER_2
FlashInfer patches unlock native FP4 on Blackwell
OPEN_SOURCE ↗
REDDIT · REDDIT// 34d agoINFRASTRUCTURE

FlashInfer patches unlock native FP4 on Blackwell

A detailed Reddit debug report shows that patched FlashInfer kernels plus CUDA 13.0’s `compute_120f` target finally produce correct native NVFP4 Mixture-of-Experts output on RTX PRO 6000 Blackwell GPUs, reaching about 39 tok/s. The finding matters because the default vLLM/CUTLASS grouped GEMM path on SM120 was returning garbage generations, turning this from a tuning problem into a real correctness bug in the inference stack.

// ANALYSIS

This is exactly the kind of low-level inference bug that can make a brand-new GPU generation look broken until somebody digs through the kernel stack by hand.

  • The biggest takeaway is correctness, not speed: the default grouped GEMM path was emitting incoherent tokens on SM120 desktop Blackwell, so “native FP4 support” was not actually usable for MoE workloads.
  • The working path runs through FlashInfer CUTLASS with SM120 capability patches and CUDA 13.0’s `compute_120f`, which strongly suggests architecture targeting and TMA kernel enablement were the real bottlenecks.
  • 39 tok/s still trails Marlin’s 46-49 tok/s, but it is close enough to prove native NVFP4 on desktop Blackwell is viable once upstream kernels stop falling back to slow tactics.
  • For AI infra teams deploying Blackwell workstations, this is a useful warning that feature-matrix support and real-world kernel behavior can diverge badly on new architectures.
// TAGS
flashinfervllminferencegpuopen-source

DISCOVERED

34d ago

2026-03-09

PUBLISHED

34d ago

2026-03-09

RELEVANCE

8/ 10

AUTHOR

lawdawgattorney