REDDIT · REDDIT// 5h agoOPENSOURCE RELEASE

llama.cpp merges native Blackwell NVFP4 support

llama.cpp has merged preliminary SM120 native NVFP4 MMQ support, bringing hardware-native FP4 inference to Blackwell-class GPUs. The post also notes that GGUF builds are already appearing for models like Gemma 4, Nemotron Cascade 2, and Qwen3.5 in NVFP4 form.

// ANALYSIS

This is a meaningful infrastructure step, not just another quantization tweak: llama.cpp is moving from "can load the format" toward actually exploiting Blackwell silicon the way NVIDIA intended. It’s still preliminary, but it should matter immediately to anyone chasing better throughput-per-watt on local or semi-local rigs.

–The merge targets SM120 Blackwell GPUs, so the win is tied to newer NVIDIA hardware rather than a broad across-the-board speedup
–Native NVFP4 support lowers the gap between model packaging and kernel support, which is why GGUF variants are already surfacing so quickly
–For local inference users, this strengthens llama.cpp’s position as the first stop for bleeding-edge quant formats and vendor-specific hardware features
–The "preliminary" label matters: expect rough edges, model-by-model quirks, and a period of rapid follow-up fixes
–This is especially relevant for MoE and larger models where memory bandwidth and quantized math are the bottlenecks

// TAGS

llama-cppgpuinferenceopen-sourcellmself-hosted

DISCOVERED

5h ago

2026-04-29

PUBLISHED

8h ago

2026-04-29

RELEVANCE

9/ 10

AUTHOR

ggonavyy