vLLM hits 630 tok/s on RTX 5090 with Nemotron

// 117d agoTUTORIAL

vLLM hits 630 tok/s on RTX 5090 with Nemotron

A practitioner's benchmark of Nemotron Nano 9B v2 Japanese on an RTX 5090 with vLLM 0.15.1, documenting 630 tok/s batched throughput at BF16 and three non-obvious configuration bugs that silently break reasoning model output.

// ANALYSIS

Consumer Blackwell running production inference is now real — the RTX 5090's 32 GB VRAM changes the calculus for local model deployment, and this post is the sharpest practical field guide yet for getting Nemotron's Mamba-hybrid architecture working correctly on vLLM.

–The HuggingFace reasoning parser plugin ships with broken import paths on vLLM 0.15.1 — a silent crash on startup that would stump anyone who didn't know to look
–Setting max_tokens below 1024 with reasoning enabled returns `content: null` with no error; thinking tokens silently eat the entire token budget before any output is generated
–The `--mamba_ssm_cache_dtype float32` flag is mandatory for accuracy — omitting it doesn't crash, it just quietly degrades model outputs
–TRT-LLM was benched and rejected: Nemotron's Mamba-hybrid architecture has limited TRT-LLM Mamba2 support, and without FP8 the speed gains shrink to 10–30% over vLLM — not worth the operational complexity
–630 tok/s batch throughput on a consumer GPU with BF16 and no quantization is a meaningful data point for anyone sizing inference hardware

// TAGS

vllminferencegpullmreasoningbenchmarkopen-source

DISCOVERED

117d ago

2026-03-16

PUBLISHED

117d ago

2026-03-15

RELEVANCE

7/ 10

AUTHOR

Impressive_Tower_550

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE17m ago

OpenAI launches ChatGPT browser, desktop automation

OpenAI has released new settings for ChatGPT that allow the assistant to browse the web autonomously and execute actions across local desktop applications. Powered by the new GPT-5.6 model family, these features transform ChatGPT from a text-based conversational partner into an agentic tool capable of navigating user environments to perform multi-step tasks.

NEWS3h ago

Zebra stripes trick drone vision AI

Forces in the Ukraine war are painting military vehicles with high-contrast zebra patterns to trick autonomous drone machine-vision algorithms. However, experts note this tactic only offers a temporary advantage as training datasets are quickly updated to recognize the new camouflage.

OPEN SOURCE3h ago

Nuxt surpasses 60,000 GitHub stars

Nuxt, the open-source Vue.js framework, has surpassed 60,000 stars on GitHub, solidifying its position as a leading tool for full-stack web development.

vLLM hits 630 tok/s on RTX 5090 with Nemotron