YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Blackwell devs point to vLLM for NVFP4 inference

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Blackwell devs point to vLLM for NVFP4 inference
OPEN LINK ↗
// 71d agoINFRASTRUCTURE

Blackwell devs point to vLLM for NVFP4 inference

A LocalLLaMA user asked for an open-source framework with NVFP4 support on NVIDIA Blackwell because they assumed llama.cpp only handled MXFP4. Community replies pointed to vLLM, and current vLLM release notes/docs support that direction while llama.cpp’s NVFP4 support has only recently landed on `master`.

// ANALYSIS

If you need NVFP4 right now, vLLM looks like the most practical open-source route, but support is still version-sensitive and evolving quickly.

  • vLLM `v0.12.0` release notes include “NVFP4 MoE CUTLASS support for SM120” (Blackwell-class RTX cards): https://github.com/vllm-project/vllm/releases
  • vLLM docs explicitly list ModelOpt `NVFP4` checkpoints (`quantization="modelopt_fp4"`): https://docs.vllm.ai/en/latest/features/quantization/modelopt/
  • TensorRT-LLM documents NVFP4 for Blackwell plus a precision support matrix, which sets a strong reference point for production readiness: https://nvidia.github.io/TensorRT-LLM/reference/precision.html
  • llama.cpp merged “add NVFP4 quantization type support” on March 11, 2026, with GPU backend pieces discussed as follow-up work: https://github.com/ggml-org/llama.cpp/pull/19769
// TAGS
vllmblackwellnvfp4inferencegpuopen-sourcellama-cpptensorrt-llm

DISCOVERED

71d ago

2026-03-17

PUBLISHED

71d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

ResponsibleTruck4717