BACK_TO_FEEDAICRIER_2
Blackwell devs point to vLLM for NVFP4 inference
OPEN_SOURCE ↗
REDDIT · REDDIT// 26d agoINFRASTRUCTURE

Blackwell devs point to vLLM for NVFP4 inference

A LocalLLaMA user asked for an open-source framework with NVFP4 support on NVIDIA Blackwell because they assumed llama.cpp only handled MXFP4. Community replies pointed to vLLM, and current vLLM release notes/docs support that direction while llama.cpp’s NVFP4 support has only recently landed on `master`.

// ANALYSIS

If you need NVFP4 right now, vLLM looks like the most practical open-source route, but support is still version-sensitive and evolving quickly.

  • vLLM `v0.12.0` release notes include “NVFP4 MoE CUTLASS support for SM120” (Blackwell-class RTX cards): https://github.com/vllm-project/vllm/releases
  • vLLM docs explicitly list ModelOpt `NVFP4` checkpoints (`quantization="modelopt_fp4"`): https://docs.vllm.ai/en/latest/features/quantization/modelopt/
  • TensorRT-LLM documents NVFP4 for Blackwell plus a precision support matrix, which sets a strong reference point for production readiness: https://nvidia.github.io/TensorRT-LLM/reference/precision.html
  • llama.cpp merged “add NVFP4 quantization type support” on March 11, 2026, with GPU backend pieces discussed as follow-up work: https://github.com/ggml-org/llama.cpp/pull/19769
// TAGS
vllmblackwellnvfp4inferencegpuopen-sourcellama-cpptensorrt-llm

DISCOVERED

26d ago

2026-03-17

PUBLISHED

26d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

ResponsibleTruck4717