BACK_TO_FEEDAICRIER_2
Dual RTX 3090 setup probes PCIe limits
OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoINFRASTRUCTURE

Dual RTX 3090 setup probes PCIe limits

A Reddit post in r/LocalLLaMA asks whether running an LLM across two RTX 3090s—with the second card stuck on a much slower PCIe 2.0 x4 slot—will hurt inference beyond slower model load times. The core issue is whether limited host-side bandwidth meaningfully slows inter-GPU communication for distributed local LLM workloads.

// ANALYSIS

This is a useful local-LLM infrastructure question because PCIe bandwidth usually matters far more for model loading, tensor-parallel sync, and offload-heavy workloads than for basic steady-state token generation.

  • Community benchmarks on related multi-GPU and eGPU setups suggest load times take the biggest hit first, while decode speed can stay surprisingly close when the workload avoids heavy cross-GPU synchronization
  • Tensor parallelism is the danger zone here: frameworks like vLLM explicitly warn that higher tensor parallel sizes can add synchronization overhead, so a weak second slot can become a real bottleneck
  • Pipeline-style splits and simple VRAM expansion are generally more forgiving than tightly coupled distributed inference, especially on consumer dual-3090 rigs
  • The post is notable as practical operator pain rather than product news: local LLM builders keep discovering that motherboard lane layout matters almost as much as raw VRAM
  • For AI developers, the takeaway is that dual x16 is ideal but not strictly required; the right answer depends on the inference engine, sharding method, and whether prompt processing or offload traffic dominates
// TAGS
rtx-3090llmgpuinferenceself-hosted

DISCOVERED

32d ago

2026-03-10

PUBLISHED

35d ago

2026-03-07

RELEVANCE

6/ 10

AUTHOR

Quiet_Dasy