REDDIT · REDDIT// 6h agoBENCHMARK RESULT

RTX PRO 6000 benchmarks vLLM throughput

A LocalLLaMA user reports strong Qwen3 27B FP8 throughput on an RTX PRO 6000 Blackwell Workstation card and asks how far vLLM 0.20.1 nightly can be pushed for speed plus concurrency. The post reads like an early real-world benchmark for workstation-class inference, not just a tuning question.

// ANALYSIS

This is a useful signal for anyone building local agent stacks: a 96GB Blackwell workstation GPU is already deep into “serve multiple agents at once” territory, and the real game now is finding the batching and speculative-decoding sweet spot.

–The reported 763.5 tokens/s prompt throughput and 1320.2 tokens/s generation throughput at 28 running requests suggest the setup is already optimized for throughput, not single-request latency.
–GPU KV cache usage at 50.4% and near-zero prefix-cache hits imply the workload is dominated by fresh, heterogeneous prompts, so cache tricks are likely secondary to batching behavior.
–The speculative decoding metrics show decent acceptance, but also clear room to tune draft model choice, speculation depth, and context-length tradeoffs.
–For agent workloads, this is the right benchmark axis: sustained concurrency and total tokens/sec matter more than headline latency.
–NVIDIA’s own positioning for the RTX PRO 6000 Blackwell emphasizes 96GB of GDDR7 and AI inference workloads, so this post is a good fit for the hardware’s intended use case.

// TAGS

vllmllminferencegpubenchmarkagentopen-source

DISCOVERED

6h ago

2026-05-01

PUBLISHED

7h ago

2026-04-30

RELEVANCE

8/ 10

AUTHOR

Bowdenzug