OPEN_SOURCE ↗
REDDIT · REDDIT// 26d agoBENCHMARK RESULT
llama.cpp multi-GPU P2P hack hits PCIe wall
A LocalLLaMA benchmark on a Threadripper 7970X rig (RTX 5090 + dual RTX PRO 4000 Blackwell) shows NVIDIA’s patched 570.148.08 P2P driver can enable ~26.17 GB/s GPU-to-GPU DMA between the two PRO cards, but it does not improve llama.cpp generation throughput for Qwen3-Next-80B-A3B. Generation slightly regressed in split setups, while single-GPU runs remained much faster when models fit in one card’s VRAM.
// ANALYSIS
The benchmark is a sharp reminder that multi-GPU inference is limited by the slowest interconnect hop, not the fastest one.
- –P2P worked only between the two RTX PRO 4000s, not between the RTX 5090 and PRO cards, so the end-to-end path still bottlenecks on host memory transit.
- –In `--split-mode layer`, the pipeline is starved before the fast P2P leg, so direct DMA gains do not translate into token generation speedups.
- –In `--split-mode row`, dual PRO 4000 results were strong, but adding the 5090 introduced slight generation slowdown, suggesting synchronization and heterogenous-link overhead.
- –The data reinforces a practical rule: use one GPU whenever possible, and treat multi-GPU primarily as a VRAM-capacity strategy rather than a guaranteed speed strategy.
// TAGS
llama-cppinferencegpubenchmarkself-hostedopen-source
DISCOVERED
26d ago
2026-03-17
PUBLISHED
26d ago
2026-03-17
RELEVANCE
8/ 10
AUTHOR
JB_King1919