REDDIT · REDDIT// 34d agoINFRASTRUCTURE

FlashInfer patches unlock native FP4 on Blackwell

A detailed Reddit debug report shows that patched FlashInfer kernels plus CUDA 13.0’s `compute_120f` target finally produce correct native NVFP4 Mixture-of-Experts output on RTX PRO 6000 Blackwell GPUs, reaching about 39 tok/s. The finding matters because the default vLLM/CUTLASS grouped GEMM path on SM120 was returning garbage generations, turning this from a tuning problem into a real correctness bug in the inference stack.

// ANALYSIS

This is exactly the kind of low-level inference bug that can make a brand-new GPU generation look broken until somebody digs through the kernel stack by hand.

–The biggest takeaway is correctness, not speed: the default grouped GEMM path was emitting incoherent tokens on SM120 desktop Blackwell, so “native FP4 support” was not actually usable for MoE workloads.
–The working path runs through FlashInfer CUTLASS with SM120 capability patches and CUDA 13.0’s `compute_120f`, which strongly suggests architecture targeting and TMA kernel enablement were the real bottlenecks.
–39 tok/s still trails Marlin’s 46-49 tok/s, but it is close enough to prove native NVFP4 on desktop Blackwell is viable once upstream kernels stop falling back to slow tactics.
–For AI infra teams deploying Blackwell workstations, this is a useful warning that feature-matrix support and real-world kernel behavior can diverge badly on new architectures.

// TAGS

flashinfervllminferencegpuopen-source

DISCOVERED

34d ago

2026-03-09

PUBLISHED

34d ago

2026-03-09

RELEVANCE

8/ 10

AUTHOR

lawdawgattorney