OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoOPENSOURCE RELEASE
llama.cpp merges native Blackwell NVFP4 support
llama.cpp has merged preliminary SM120 native NVFP4 MMQ support, bringing hardware-native FP4 inference to Blackwell-class GPUs. The post also notes that GGUF builds are already appearing for models like Gemma 4, Nemotron Cascade 2, and Qwen3.5 in NVFP4 form.
// ANALYSIS
This is a meaningful infrastructure step, not just another quantization tweak: llama.cpp is moving from "can load the format" toward actually exploiting Blackwell silicon the way NVIDIA intended. It’s still preliminary, but it should matter immediately to anyone chasing better throughput-per-watt on local or semi-local rigs.
- –The merge targets SM120 Blackwell GPUs, so the win is tied to newer NVIDIA hardware rather than a broad across-the-board speedup
- –Native NVFP4 support lowers the gap between model packaging and kernel support, which is why GGUF variants are already surfacing so quickly
- –For local inference users, this strengthens llama.cpp’s position as the first stop for bleeding-edge quant formats and vendor-specific hardware features
- –The "preliminary" label matters: expect rough edges, model-by-model quirks, and a period of rapid follow-up fixes
- –This is especially relevant for MoE and larger models where memory bandwidth and quantized math are the bottlenecks
// TAGS
llama-cppgpuinferenceopen-sourcellmself-hosted
DISCOVERED
5h ago
2026-04-29
PUBLISHED
8h ago
2026-04-29
RELEVANCE
9/ 10
AUTHOR
ggonavyy