OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoINFRASTRUCTURE
Dual RTX 3090 setup probes PCIe limits
A Reddit post in r/LocalLLaMA asks whether running an LLM across two RTX 3090s—with the second card stuck on a much slower PCIe 2.0 x4 slot—will hurt inference beyond slower model load times. The core issue is whether limited host-side bandwidth meaningfully slows inter-GPU communication for distributed local LLM workloads.
// ANALYSIS
This is a useful local-LLM infrastructure question because PCIe bandwidth usually matters far more for model loading, tensor-parallel sync, and offload-heavy workloads than for basic steady-state token generation.
- –Community benchmarks on related multi-GPU and eGPU setups suggest load times take the biggest hit first, while decode speed can stay surprisingly close when the workload avoids heavy cross-GPU synchronization
- –Tensor parallelism is the danger zone here: frameworks like vLLM explicitly warn that higher tensor parallel sizes can add synchronization overhead, so a weak second slot can become a real bottleneck
- –Pipeline-style splits and simple VRAM expansion are generally more forgiving than tightly coupled distributed inference, especially on consumer dual-3090 rigs
- –The post is notable as practical operator pain rather than product news: local LLM builders keep discovering that motherboard lane layout matters almost as much as raw VRAM
- –For AI developers, the takeaway is that dual x16 is ideal but not strictly required; the right answer depends on the inference engine, sharding method, and whether prompt processing or offload traffic dominates
// TAGS
rtx-3090llmgpuinferenceself-hosted
DISCOVERED
32d ago
2026-03-10
PUBLISHED
35d ago
2026-03-07
RELEVANCE
6/ 10
AUTHOR
Quiet_Dasy