REDDIT · REDDIT// 5h agoOPENSOURCE RELEASE

llama.cpp, ik_llama.cpp land native FP4 support

Llama.cpp and the ik_llama.cpp fork have integrated native FP4 support, introducing the GGML_TYPE_NVFP4 and GGML_TYPE_MXFP4 formats. This milestone enables high-precision 4-bit inference with significant VRAM savings and hardware-level acceleration on NVIDIA's Blackwell GPUs, bridging the gap between quantization efficiency and model performance.

// ANALYSIS

Native FP4 is a paradigm shift that brings hardware-level efficiency to local LLM quantization without the quality trade-offs of integer methods.

–NVFP4 enables native Blackwell Tensor Core acceleration for up to 2.3x inference speedups.
–MXFP4 provides a cross-platform 4-bit float standard for CPU and non-Blackwell GPU backends.
–Direct collaboration with NVIDIA ensures seamless support for next-gen models like gpt-oss.
–The implementation significantly reduces the VRAM barrier to running large models on consumer-grade hardware.
–Evolution of the GGUF format to support block-scaled FP4 weights preserves model fidelity during quantization.

// TAGS

llama-cppllminferencegpuopen-source

DISCOVERED

5h ago

2026-04-25

PUBLISHED

6h ago

2026-04-25

RELEVANCE

9/ 10

AUTHOR

Usual-Carrot6352