OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
llama.cpp lands native NVFP4 on Blackwell
llama.cpp b8967 adds native NVFP4 support for Blackwell GPUs, backed by a fresh CUDA benchmark run on an RTX 5090-class system. The posted results show very high prefill throughput and roughly 70 tok/s decode on a Qwen3.6 27B NVFP4 model.
// ANALYSIS
This is a meaningful infrastructure win for local AI on Nvidia’s newest hardware: the format support is in place, and the bench numbers suggest it is already useful for real workloads, not just a compatibility checkbox.
- –Native NVFP4 matters because Blackwell’s 4-bit path is part of the hardware story, not an afterthought; llama.cpp is now tracking that story closely.
- –The benchmark profile is heavily prefill-friendly: 5.5K+ tok/s at short contexts, then a gradual drop as depth increases, which is what you’d expect from a memory-bandwidth and attention-pressure story rather than a pure compute bottleneck.
- –Decode around 70 tok/s on a 27B model is strong for a local setup, especially with the whole model on a single GPU and no CPU offload in the test.
- –This is a good signal for Blackwell owners, but it is still one data point from one model and one build; different architectures, contexts, and batch shapes can change the picture.
- –The release note framing suggests this is the first real integration step, so expect follow-up fixes and tuning as more NVFP4 models and kernels land.
// TAGS
llama-cppgpuinferencebenchmarkopen-source
DISCOVERED
3h ago
2026-04-29
PUBLISHED
6h ago
2026-04-29
RELEVANCE
9/ 10
AUTHOR
mossy_troll_84