Blackwell devs point to vLLM for NVFP4 inference

// 118d agoINFRASTRUCTURE

Blackwell devs point to vLLM for NVFP4 inference

A LocalLLaMA user asked for an open-source framework with NVFP4 support on NVIDIA Blackwell because they assumed llama.cpp only handled MXFP4. Community replies pointed to vLLM, and current vLLM release notes/docs support that direction while llama.cpp’s NVFP4 support has only recently landed on `master`.

// ANALYSIS

If you need NVFP4 right now, vLLM looks like the most practical open-source route, but support is still version-sensitive and evolving quickly.

–vLLM `v0.12.0` release notes include “NVFP4 MoE CUTLASS support for SM120” (Blackwell-class RTX cards): https://github.com/vllm-project/vllm/releases
–vLLM docs explicitly list ModelOpt `NVFP4` checkpoints (`quantization="modelopt_fp4"`): https://docs.vllm.ai/en/latest/features/quantization/modelopt/
–TensorRT-LLM documents NVFP4 for Blackwell plus a precision support matrix, which sets a strong reference point for production readiness: https://nvidia.github.io/TensorRT-LLM/reference/precision.html
–llama.cpp merged “add NVFP4 quantization type support” on March 11, 2026, with GPU backend pieces discussed as follow-up work: https://github.com/ggml-org/llama.cpp/pull/19769

// TAGS

vllmblackwellnvfp4inferencegpuopen-sourcellama-cpptensorrt-llm

DISCOVERED

118d ago

2026-03-17

PUBLISHED

118d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

ResponsibleTruck4717

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE27m ago

OpenDesign integrates Meta Muse Spark API

OpenDesign is an open-source, local-first design workspace that can be paired with Meta's Muse Spark to generate code-ready prototypes and UI screens directly from screenshots and prompts. This integration bridges the gap between visual design and software development, providing developers with an interactive workspace to rapidly iterate on AI-generated user interfaces.

UPDATE27m ago

T3 Code updates agent GUI with git worktrees

T3 Code has updated its local-first GUI for orchestrating AI coding agents, adding multi-provider key and subscription management. The release also introduces native support for git worktrees, custom automation actions, and side-by-side split diffs to safely run multiple agent workflows in parallel.

UPDATE1h ago

Grok Build adds multiline input, scrolling

SpaceXAI has released Grok Build versions 0.2.99 and 0.2.98, introducing multiline input and terminal scrolling for its terminal-based AI coding assistant. The updates allow users to input complex prompts directly on the dashboard and scroll through chat histories using PageUp and PageDown.