OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoBENCHMARK RESULT
NVLink boosts dual-3090 Qwen throughput
A LocalLLaMA benchmark shows two RTX 3090s linked with NVLink materially outperform several non-NVLink topologies when running Qwen3.5 27B FP8. The posted results show faster single-stream generation, much higher aggregate throughput under concurrency, and sharply better prefill/TTFT, suggesting interconnect bandwidth still matters for serious multi-GPU local inference.
// ANALYSIS
This is a useful reality check for anyone assuming consumer multi-GPU inference is compute-bound first and topology-bound second.
- –The NVLink setup hit 79.4 tok/s single-stream versus roughly 70-74 tok/s without NVLink
- –Under 20 concurrent generations, throughput jumped to 693.2 tok/s versus about 493-542 tok/s on the non-NVLink layouts
- –Prefill improved the most, rising to 2,181 tok/s with about 7.1s TTFT versus roughly 1,395-1,677 tok/s and 9.2-11.0s TTFT without NVLink
- –The post’s PLX note is the real takeaway: on consumer cards, PCIe topology and peer-to-peer limits can erase a lot of the benefit of adding a second GPU
// TAGS
nvlinkgpuinferencebenchmarkllm
DISCOVERED
32d ago
2026-03-11
PUBLISHED
32d ago
2026-03-11
RELEVANCE
8/ 10
AUTHOR
Conscious_Cut_6144