OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoINFRASTRUCTURE
vLLM throughput question targets RTX PRO 6000
A Reddit user asks what saturated vLLM throughput looks like on 4x or 8x RTX PRO 6000 GPUs for Gemma 4 31B or 26B-A4B at 8-bit quantization. The use case is a highly concurrent translation workload, with the user estimating 10k+ in-flight requests and trying to judge whether rented GPUs beat API costs.
// ANALYSIS
This is less a benchmark post than a capacity-planning sanity check: the real answer depends on prompt length, output length, batching behavior, quantization format, and whether the deployment uses replication or tensor parallelism.
- –The thread’s replies suggest a real throughput ceiling can be reached with the right serving setup, but the quantization choice matters; FP8 or AWQ may outperform NVFP4 in practice.
- –One commenter shares single-GPU RTX 6000 and B200 numbers that imply "a few thousand tok/s" per card is plausible for Gemma-class workloads, but only under specific concurrency and sequence-length conditions.
- –Another practical point is architectural: running multiple independent vLLM instances may scale better than one giant tensor-split instance, especially if the goal is aggregate throughput rather than one monolithic model shard.
- –The post also hints at a common local-inference trap: if the workload is temporary or bursty, larger datacenter GPUs may be easier to justify than workstation-class cards once software overhead and operator time are included.
// TAGS
vllminferencegpullmopen-sourceself-hosted
DISCOVERED
5h ago
2026-04-24
PUBLISHED
7h ago
2026-04-24
RELEVANCE
8/ 10
AUTHOR
AdventurousFly4909