llama.cpp multi-GPU P2P hack hits PCIe wall

// 117d agoBENCHMARK RESULT

llama.cpp multi-GPU P2P hack hits PCIe wall

A LocalLLaMA benchmark on a Threadripper 7970X rig (RTX 5090 + dual RTX PRO 4000 Blackwell) shows NVIDIA’s patched 570.148.08 P2P driver can enable ~26.17 GB/s GPU-to-GPU DMA between the two PRO cards, but it does not improve llama.cpp generation throughput for Qwen3-Next-80B-A3B. Generation slightly regressed in split setups, while single-GPU runs remained much faster when models fit in one card’s VRAM.

// ANALYSIS

The benchmark is a sharp reminder that multi-GPU inference is limited by the slowest interconnect hop, not the fastest one.

–P2P worked only between the two RTX PRO 4000s, not between the RTX 5090 and PRO cards, so the end-to-end path still bottlenecks on host memory transit.
–In `--split-mode layer`, the pipeline is starved before the fast P2P leg, so direct DMA gains do not translate into token generation speedups.
–In `--split-mode row`, dual PRO 4000 results were strong, but adding the 5090 introduced slight generation slowdown, suggesting synchronization and heterogenous-link overhead.
–The data reinforces a practical rule: use one GPU whenever possible, and treat multi-GPU primarily as a VRAM-capacity strategy rather than a guaranteed speed strategy.

// TAGS

llama-cppinferencegpubenchmarkself-hostedopen-source

DISCOVERED

117d ago

2026-03-17

PUBLISHED

117d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

JB_King1919

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE46m ago

OpenDisplay turns iOS devices into Mac monitors

OpenDisplay is an open-source utility that streams macOS desktops to iPads or iPhones over USB or Wi-Fi, turning them into low-latency, high-resolution external monitors. Leveraging macOS's private CGVirtualDisplay API, ScreenCaptureKit, and VideoToolbox, it integrates directly into macOS Display settings as a true extended display without needing external servers or telemetry.

OPEN SOURCE46m ago

NASA releases SpaceWasm flight WebAssembly interpreter

spacewasm is a WebAssembly interpreter developed by NASA and Caltech for safety-critical flight software. Written in Rust, it decodes Wasm modules in a single pass into an optimized intermediate representation and utilizes a custom memory model with fixed-size allocation pages to guarantee deterministic execution and avoid memory panics in resource-constrained embedded systems.

OPEN SOURCE46m ago

Agent Skills guides agent UI design

Agent Skills is an open-source library and prompting system designed to help front-end coding agents like Cursor and Claude Code build premium user interfaces. The project provides reusable design guardrails and procedural workflows for advanced styling, GSAP animations, and WebGL.