BACK_TO_FEEDAICRIER_2
llama.cpp merges native Blackwell NVFP4 support
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoOPENSOURCE RELEASE

llama.cpp merges native Blackwell NVFP4 support

llama.cpp has merged preliminary SM120 native NVFP4 MMQ support, bringing hardware-native FP4 inference to Blackwell-class GPUs. The post also notes that GGUF builds are already appearing for models like Gemma 4, Nemotron Cascade 2, and Qwen3.5 in NVFP4 form.

// ANALYSIS

This is a meaningful infrastructure step, not just another quantization tweak: llama.cpp is moving from "can load the format" toward actually exploiting Blackwell silicon the way NVIDIA intended. It’s still preliminary, but it should matter immediately to anyone chasing better throughput-per-watt on local or semi-local rigs.

  • The merge targets SM120 Blackwell GPUs, so the win is tied to newer NVIDIA hardware rather than a broad across-the-board speedup
  • Native NVFP4 support lowers the gap between model packaging and kernel support, which is why GGUF variants are already surfacing so quickly
  • For local inference users, this strengthens llama.cpp’s position as the first stop for bleeding-edge quant formats and vendor-specific hardware features
  • The "preliminary" label matters: expect rough edges, model-by-model quirks, and a period of rapid follow-up fixes
  • This is especially relevant for MoE and larger models where memory bandwidth and quantized math are the bottlenecks
// TAGS
llama-cppgpuinferenceopen-sourcellmself-hosted

DISCOVERED

5h ago

2026-04-29

PUBLISHED

8h ago

2026-04-29

RELEVANCE

9/ 10

AUTHOR

ggonavyy