REDDIT · REDDIT// 3h agoBENCHMARK RESULT

vLLM benchmark undercuts PCIe bottleneck fears

A user benchmarked TP=2 prefill on 2x RTX 5060 Ti 16GB, plus a third GPU path via a weak PCIe 4.0 x4 link, and saw only 3-4 GB/s peak traffic at 32k context. The result suggests this specific local-LLM workload is more likely VRAM or compute limited than PCIe limited.

// ANALYSIS

This is a useful reality check, but it is still one workload on one motherboard, not proof that PCIe never matters for multi-GPU inference.

–The measured traffic staying at roughly 40-50% of x4 Gen4 suggests there is headroom on the interconnect for this prefill-heavy setup
–Long-context prefill can remain inside PCIe limits when the GPUs themselves are the bottleneck
–The real constraint may shift to chipset lane sharing once a fourth card depends on downstream lanes
–Different serving phases can behave differently, so decode, smaller batches, or other model layouts may produce very different bandwidth pressure
–For local-LLM builders, the practical takeaway is to benchmark the exact stack instead of assuming consumer multi-GPU is automatically PCIe-bound

// TAGS

vllmllminferencegpuquantizationlong-contextbenchmark

DISCOVERED

3h ago

2026-05-06

PUBLISHED

3h ago

2026-05-06

RELEVANCE

7/ 10

AUTHOR

ziphnor