YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

llama.cpp multi-GPU P2P hack hits PCIe wall

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

llama.cpp multi-GPU P2P hack hits PCIe wall
OPEN LINK ↗
// 71d agoBENCHMARK RESULT

llama.cpp multi-GPU P2P hack hits PCIe wall

A LocalLLaMA benchmark on a Threadripper 7970X rig (RTX 5090 + dual RTX PRO 4000 Blackwell) shows NVIDIA’s patched 570.148.08 P2P driver can enable ~26.17 GB/s GPU-to-GPU DMA between the two PRO cards, but it does not improve llama.cpp generation throughput for Qwen3-Next-80B-A3B. Generation slightly regressed in split setups, while single-GPU runs remained much faster when models fit in one card’s VRAM.

// ANALYSIS

The benchmark is a sharp reminder that multi-GPU inference is limited by the slowest interconnect hop, not the fastest one.

  • P2P worked only between the two RTX PRO 4000s, not between the RTX 5090 and PRO cards, so the end-to-end path still bottlenecks on host memory transit.
  • In `--split-mode layer`, the pipeline is starved before the fast P2P leg, so direct DMA gains do not translate into token generation speedups.
  • In `--split-mode row`, dual PRO 4000 results were strong, but adding the 5090 introduced slight generation slowdown, suggesting synchronization and heterogenous-link overhead.
  • The data reinforces a practical rule: use one GPU whenever possible, and treat multi-GPU primarily as a VRAM-capacity strategy rather than a guaranteed speed strategy.
// TAGS
llama-cppinferencegpubenchmarkself-hostedopen-source

DISCOVERED

71d ago

2026-03-17

PUBLISHED

71d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

JB_King1919