BACK_TO_FEEDAICRIER_2
vLLM throughput question targets RTX PRO 6000
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoINFRASTRUCTURE

vLLM throughput question targets RTX PRO 6000

A Reddit user asks what saturated vLLM throughput looks like on 4x or 8x RTX PRO 6000 GPUs for Gemma 4 31B or 26B-A4B at 8-bit quantization. The use case is a highly concurrent translation workload, with the user estimating 10k+ in-flight requests and trying to judge whether rented GPUs beat API costs.

// ANALYSIS

This is less a benchmark post than a capacity-planning sanity check: the real answer depends on prompt length, output length, batching behavior, quantization format, and whether the deployment uses replication or tensor parallelism.

  • The thread’s replies suggest a real throughput ceiling can be reached with the right serving setup, but the quantization choice matters; FP8 or AWQ may outperform NVFP4 in practice.
  • One commenter shares single-GPU RTX 6000 and B200 numbers that imply "a few thousand tok/s" per card is plausible for Gemma-class workloads, but only under specific concurrency and sequence-length conditions.
  • Another practical point is architectural: running multiple independent vLLM instances may scale better than one giant tensor-split instance, especially if the goal is aggregate throughput rather than one monolithic model shard.
  • The post also hints at a common local-inference trap: if the workload is temporary or bursty, larger datacenter GPUs may be easier to justify than workstation-class cards once software overhead and operator time are included.
// TAGS
vllminferencegpullmopen-sourceself-hosted

DISCOVERED

5h ago

2026-04-24

PUBLISHED

7h ago

2026-04-24

RELEVANCE

8/ 10

AUTHOR

AdventurousFly4909