BACK_TO_FEEDAICRIER_2
llama.cpp multi-GPU P2P hack hits PCIe wall
OPEN_SOURCE ↗
REDDIT · REDDIT// 26d agoBENCHMARK RESULT

llama.cpp multi-GPU P2P hack hits PCIe wall

A LocalLLaMA benchmark on a Threadripper 7970X rig (RTX 5090 + dual RTX PRO 4000 Blackwell) shows NVIDIA’s patched 570.148.08 P2P driver can enable ~26.17 GB/s GPU-to-GPU DMA between the two PRO cards, but it does not improve llama.cpp generation throughput for Qwen3-Next-80B-A3B. Generation slightly regressed in split setups, while single-GPU runs remained much faster when models fit in one card’s VRAM.

// ANALYSIS

The benchmark is a sharp reminder that multi-GPU inference is limited by the slowest interconnect hop, not the fastest one.

  • P2P worked only between the two RTX PRO 4000s, not between the RTX 5090 and PRO cards, so the end-to-end path still bottlenecks on host memory transit.
  • In `--split-mode layer`, the pipeline is starved before the fast P2P leg, so direct DMA gains do not translate into token generation speedups.
  • In `--split-mode row`, dual PRO 4000 results were strong, but adding the 5090 introduced slight generation slowdown, suggesting synchronization and heterogenous-link overhead.
  • The data reinforces a practical rule: use one GPU whenever possible, and treat multi-GPU primarily as a VRAM-capacity strategy rather than a guaranteed speed strategy.
// TAGS
llama-cppinferencegpubenchmarkself-hostedopen-source

DISCOVERED

26d ago

2026-03-17

PUBLISHED

26d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

JB_King1919