BACK_TO_FEEDAICRIER_2
llama.cpp lands native NVFP4 on Blackwell
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

llama.cpp lands native NVFP4 on Blackwell

llama.cpp b8967 adds native NVFP4 support for Blackwell GPUs, backed by a fresh CUDA benchmark run on an RTX 5090-class system. The posted results show very high prefill throughput and roughly 70 tok/s decode on a Qwen3.6 27B NVFP4 model.

// ANALYSIS

This is a meaningful infrastructure win for local AI on Nvidia’s newest hardware: the format support is in place, and the bench numbers suggest it is already useful for real workloads, not just a compatibility checkbox.

  • Native NVFP4 matters because Blackwell’s 4-bit path is part of the hardware story, not an afterthought; llama.cpp is now tracking that story closely.
  • The benchmark profile is heavily prefill-friendly: 5.5K+ tok/s at short contexts, then a gradual drop as depth increases, which is what you’d expect from a memory-bandwidth and attention-pressure story rather than a pure compute bottleneck.
  • Decode around 70 tok/s on a 27B model is strong for a local setup, especially with the whole model on a single GPU and no CPU offload in the test.
  • This is a good signal for Blackwell owners, but it is still one data point from one model and one build; different architectures, contexts, and batch shapes can change the picture.
  • The release note framing suggests this is the first real integration step, so expect follow-up fixes and tuning as more NVFP4 models and kernels land.
// TAGS
llama-cppgpuinferencebenchmarkopen-source

DISCOVERED

3h ago

2026-04-29

PUBLISHED

6h ago

2026-04-29

RELEVANCE

9/ 10

AUTHOR

mossy_troll_84