OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoOPENSOURCE RELEASE
llama.cpp, ik_llama.cpp land native FP4 support
Llama.cpp and the ik_llama.cpp fork have integrated native FP4 support, introducing the GGML_TYPE_NVFP4 and GGML_TYPE_MXFP4 formats. This milestone enables high-precision 4-bit inference with significant VRAM savings and hardware-level acceleration on NVIDIA's Blackwell GPUs, bridging the gap between quantization efficiency and model performance.
// ANALYSIS
Native FP4 is a paradigm shift that brings hardware-level efficiency to local LLM quantization without the quality trade-offs of integer methods.
- –NVFP4 enables native Blackwell Tensor Core acceleration for up to 2.3x inference speedups.
- –MXFP4 provides a cross-platform 4-bit float standard for CPU and non-Blackwell GPU backends.
- –Direct collaboration with NVIDIA ensures seamless support for next-gen models like gpt-oss.
- –The implementation significantly reduces the VRAM barrier to running large models on consumer-grade hardware.
- –Evolution of the GGUF format to support block-scaled FP4 weights preserves model fidelity during quantization.
// TAGS
llama-cppllminferencegpuopen-source
DISCOVERED
5h ago
2026-04-25
PUBLISHED
6h ago
2026-04-25
RELEVANCE
9/ 10
AUTHOR
Usual-Carrot6352