OPEN_SOURCE ↗
REDDIT · REDDIT// 26d agoINFRASTRUCTURE
Blackwell devs point to vLLM for NVFP4 inference
A LocalLLaMA user asked for an open-source framework with NVFP4 support on NVIDIA Blackwell because they assumed llama.cpp only handled MXFP4. Community replies pointed to vLLM, and current vLLM release notes/docs support that direction while llama.cpp’s NVFP4 support has only recently landed on `master`.
// ANALYSIS
If you need NVFP4 right now, vLLM looks like the most practical open-source route, but support is still version-sensitive and evolving quickly.
- –vLLM `v0.12.0` release notes include “NVFP4 MoE CUTLASS support for SM120” (Blackwell-class RTX cards): https://github.com/vllm-project/vllm/releases
- –vLLM docs explicitly list ModelOpt `NVFP4` checkpoints (`quantization="modelopt_fp4"`): https://docs.vllm.ai/en/latest/features/quantization/modelopt/
- –TensorRT-LLM documents NVFP4 for Blackwell plus a precision support matrix, which sets a strong reference point for production readiness: https://nvidia.github.io/TensorRT-LLM/reference/precision.html
- –llama.cpp merged “add NVFP4 quantization type support” on March 11, 2026, with GPU backend pieces discussed as follow-up work: https://github.com/ggml-org/llama.cpp/pull/19769
// TAGS
vllmblackwellnvfp4inferencegpuopen-sourcellama-cpptensorrt-llm
DISCOVERED
26d ago
2026-03-17
PUBLISHED
26d ago
2026-03-17
RELEVANCE
8/ 10
AUTHOR
ResponsibleTruck4717