OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoTUTORIAL
vLLM hits 630 tok/s on RTX 5090 with Nemotron
A practitioner's benchmark of Nemotron Nano 9B v2 Japanese on an RTX 5090 with vLLM 0.15.1, documenting 630 tok/s batched throughput at BF16 and three non-obvious configuration bugs that silently break reasoning model output.
// ANALYSIS
Consumer Blackwell running production inference is now real — the RTX 5090's 32 GB VRAM changes the calculus for local model deployment, and this post is the sharpest practical field guide yet for getting Nemotron's Mamba-hybrid architecture working correctly on vLLM.
- –The HuggingFace reasoning parser plugin ships with broken import paths on vLLM 0.15.1 — a silent crash on startup that would stump anyone who didn't know to look
- –Setting max_tokens below 1024 with reasoning enabled returns `content: null` with no error; thinking tokens silently eat the entire token budget before any output is generated
- –The `--mamba_ssm_cache_dtype float32` flag is mandatory for accuracy — omitting it doesn't crash, it just quietly degrades model outputs
- –TRT-LLM was benched and rejected: Nemotron's Mamba-hybrid architecture has limited TRT-LLM Mamba2 support, and without FP8 the speed gains shrink to 10–30% over vLLM — not worth the operational complexity
- –630 tok/s batch throughput on a consumer GPU with BF16 and no quantization is a meaningful data point for anyone sizing inference hardware
// TAGS
vllminferencegpullmreasoningbenchmarkopen-source
DISCOVERED
27d ago
2026-03-16
PUBLISHED
27d ago
2026-03-15
RELEVANCE
7/ 10
AUTHOR
Impressive_Tower_550