BACK_TO_FEEDAICRIER_2
NVLink boosts dual-3090 Qwen throughput
OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoBENCHMARK RESULT

NVLink boosts dual-3090 Qwen throughput

A LocalLLaMA benchmark shows two RTX 3090s linked with NVLink materially outperform several non-NVLink topologies when running Qwen3.5 27B FP8. The posted results show faster single-stream generation, much higher aggregate throughput under concurrency, and sharply better prefill/TTFT, suggesting interconnect bandwidth still matters for serious multi-GPU local inference.

// ANALYSIS

This is a useful reality check for anyone assuming consumer multi-GPU inference is compute-bound first and topology-bound second.

  • The NVLink setup hit 79.4 tok/s single-stream versus roughly 70-74 tok/s without NVLink
  • Under 20 concurrent generations, throughput jumped to 693.2 tok/s versus about 493-542 tok/s on the non-NVLink layouts
  • Prefill improved the most, rising to 2,181 tok/s with about 7.1s TTFT versus roughly 1,395-1,677 tok/s and 9.2-11.0s TTFT without NVLink
  • The post’s PLX note is the real takeaway: on consumer cards, PCIe topology and peer-to-peer limits can erase a lot of the benefit of adding a second GPU
// TAGS
nvlinkgpuinferencebenchmarkllm

DISCOVERED

32d ago

2026-03-11

PUBLISHED

32d ago

2026-03-11

RELEVANCE

8/ 10

AUTHOR

Conscious_Cut_6144