BACK_TO_FEEDAICRIER_2
vLLM hits 630 tok/s on RTX 5090 with Nemotron
OPEN_SOURCE ↗
REDDIT · REDDIT// 27d agoTUTORIAL

vLLM hits 630 tok/s on RTX 5090 with Nemotron

A practitioner's benchmark of Nemotron Nano 9B v2 Japanese on an RTX 5090 with vLLM 0.15.1, documenting 630 tok/s batched throughput at BF16 and three non-obvious configuration bugs that silently break reasoning model output.

// ANALYSIS

Consumer Blackwell running production inference is now real — the RTX 5090's 32 GB VRAM changes the calculus for local model deployment, and this post is the sharpest practical field guide yet for getting Nemotron's Mamba-hybrid architecture working correctly on vLLM.

  • The HuggingFace reasoning parser plugin ships with broken import paths on vLLM 0.15.1 — a silent crash on startup that would stump anyone who didn't know to look
  • Setting max_tokens below 1024 with reasoning enabled returns `content: null` with no error; thinking tokens silently eat the entire token budget before any output is generated
  • The `--mamba_ssm_cache_dtype float32` flag is mandatory for accuracy — omitting it doesn't crash, it just quietly degrades model outputs
  • TRT-LLM was benched and rejected: Nemotron's Mamba-hybrid architecture has limited TRT-LLM Mamba2 support, and without FP8 the speed gains shrink to 10–30% over vLLM — not worth the operational complexity
  • 630 tok/s batch throughput on a consumer GPU with BF16 and no quantization is a meaningful data point for anyone sizing inference hardware
// TAGS
vllminferencegpullmreasoningbenchmarkopen-source

DISCOVERED

27d ago

2026-03-16

PUBLISHED

27d ago

2026-03-15

RELEVANCE

7/ 10

AUTHOR

Impressive_Tower_550