OPEN_SOURCE ↗
REDDIT · REDDIT// 7h agoBENCHMARK RESULT
vLLM Lags Local Runtimes on Blackwell
A Reddit user reports that on RTX Pro 6000 Blackwell GPUs, NVIDIA’s vLLM containers with NVFP4, INT4, and FP8 are still lagging behind LM Studio and Ollama on tokens per second, while also taking much longer to load models. The post questions whether Blackwell’s native 4-bit formats should deliver a larger performance jump, and notes that vLLM’s multi-token prediction is the main feature currently helping it keep up.
// ANALYSIS
Hot take: this looks less like a broken setup and more like a reminder that Blackwell support, quantization format, and serving-stack maturity are separate problems.
- –NVIDIA’s vLLM container docs now explicitly call out RTX PRO 6000 Blackwell support and NVFP4 on Blackwell, but they also say the current 25.09 container is the first one with NVIDIA GPU optimizations, so the stack is still early.
- –vLLM docs list NVFP4 and MXFP4 as Blackwell-native compression schemes, but that only tells you the hardware path exists; it does not guarantee a large throughput advantage over another runtime.
- –LM Studio publicly positions itself as an offline local model runner with an OpenAI-compatible local server, and its product page says it uses llama.cpp among its inference engines, which makes it a strong baseline for single-model local serving.
- –The huge load-time gap the user reports is plausibly about runtime overhead, model conversion, or kernel coverage in vLLM rather than precision alone. That is an inference from the docs plus the benchmark numbers in the post.
- –vLLM’s advantage here is likely in serving features such as multi-token prediction and batching/orchestration, but the post suggests those features are not enough to erase the latency gap in this specific setup.
// TAGS
vllmblackwellnvfp4mxfp4fp8int4llama.cpplm-studioollamartx-pro-6000
DISCOVERED
7h ago
2026-04-18
PUBLISHED
8h ago
2026-04-18
RELEVANCE
9/ 10
AUTHOR
aaronr_90