OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
vLLM benchmark undercuts PCIe bottleneck fears
A user benchmarked TP=2 prefill on 2x RTX 5060 Ti 16GB, plus a third GPU path via a weak PCIe 4.0 x4 link, and saw only 3-4 GB/s peak traffic at 32k context. The result suggests this specific local-LLM workload is more likely VRAM or compute limited than PCIe limited.
// ANALYSIS
This is a useful reality check, but it is still one workload on one motherboard, not proof that PCIe never matters for multi-GPU inference.
- –The measured traffic staying at roughly 40-50% of x4 Gen4 suggests there is headroom on the interconnect for this prefill-heavy setup
- –Long-context prefill can remain inside PCIe limits when the GPUs themselves are the bottleneck
- –The real constraint may shift to chipset lane sharing once a fourth card depends on downstream lanes
- –Different serving phases can behave differently, so decode, smaller batches, or other model layouts may produce very different bandwidth pressure
- –For local-LLM builders, the practical takeaway is to benchmark the exact stack instead of assuming consumer multi-GPU is automatically PCIe-bound
// TAGS
vllmllminferencegpuquantizationlong-contextbenchmark
DISCOVERED
3h ago
2026-05-06
PUBLISHED
3h ago
2026-05-06
RELEVANCE
7/ 10
AUTHOR
ziphnor