vLLM Lags Local Runtimes on Blackwell

// 45d agoBENCHMARK RESULT

vLLM Lags Local Runtimes on Blackwell

A Reddit user reports that on RTX Pro 6000 Blackwell GPUs, NVIDIA’s vLLM containers with NVFP4, INT4, and FP8 are still lagging behind LM Studio and Ollama on tokens per second, while also taking much longer to load models. The post questions whether Blackwell’s native 4-bit formats should deliver a larger performance jump, and notes that vLLM’s multi-token prediction is the main feature currently helping it keep up.

// ANALYSIS

Hot take: this looks less like a broken setup and more like a reminder that Blackwell support, quantization format, and serving-stack maturity are separate problems.

–NVIDIA’s vLLM container docs now explicitly call out RTX PRO 6000 Blackwell support and NVFP4 on Blackwell, but they also say the current 25.09 container is the first one with NVIDIA GPU optimizations, so the stack is still early.
–vLLM docs list NVFP4 and MXFP4 as Blackwell-native compression schemes, but that only tells you the hardware path exists; it does not guarantee a large throughput advantage over another runtime.
–LM Studio publicly positions itself as an offline local model runner with an OpenAI-compatible local server, and its product page says it uses llama.cpp among its inference engines, which makes it a strong baseline for single-model local serving.
–The huge load-time gap the user reports is plausibly about runtime overhead, model conversion, or kernel coverage in vLLM rather than precision alone. That is an inference from the docs plus the benchmark numbers in the post.
–vLLM’s advantage here is likely in serving features such as multi-token prediction and batching/orchestration, but the post suggests those features are not enough to erase the latency gap in this specific setup.

// TAGS

vllmblackwellnvfp4mxfp4fp8int4llama.cpplm-studioollamartx-pro-6000

DISCOVERED

45d ago

2026-04-18

PUBLISHED

45d ago

2026-04-18

RELEVANCE

9/ 10

AUTHOR

aaronr_90

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE48m ago

OpenAI introduces "Sites" on Codex, a feature allowing users to instantly generate and host live web apps and dashboards from natural language prompts.

OpenAI has launched a preview of "Sites" for its Codex AI agent platform, enabling users to build, deploy, and host interactive web applications and dashboards instantly from a text prompt. Currently available for ChatGPT Business and Enterprise workspaces, the tool bypasses traditional website builders by hosting the applications on live URLs that can be shared with teams. Along with Sites, OpenAI introduced six role-specific plugins integrating 62 apps and 110 skills (such as Salesforce, HubSpot, Snowflake, and Figma) and added annotation capabilities across documents, spreadsheets, and slides.

NEWS1h ago

OpenAI sunsets legacy Codex models

OpenAI has sunset its GPT-5.2-Codex and GPT-5.3-Codex models from the Codex agent platform, shifting the default to newer models like GPT-5.5. The removal has sparked frustration among developers who valued the deprecated models' coding precision and efficiency.

UPDATE1h ago

Factory deploys user-requested coding agent features

A user tweet commends Factory's rapid feature deployment for its autonomous coding agents, known as Droids, noting a requested feature was live within days. Factory is an agent-native software engineering platform that builds specialized AI Droids to automate development tasks like code reviews, refactoring, and migrations.