llama.cpp benchmarks validate cross-vendor eGPU inference
A developer benchmarked llama.cpp's Vulkan backend on a Strix Halo APU paired with an RTX 5070 Ti eGPU via OCuLink, proving cross-vendor tensor splitting works seamlessly. The tests debunk the myth that OCuLink's PCIe 4.0 x4 bandwidth bottlenecks local LLM token generation.
This deep dive into heterogeneous inference confirms that memory bandwidth, not PCIe constraints, dictates local LLM performance.
- –Vulkan abstracts the hardware layer to stably combine AMD and NVIDIA architectures with only a 5-10% performance penalty compared to native CUDA or ROCm.
- –OCuLink bandwidth uses less than 1% of its capacity during active token generation, as only tiny activation tensors are passed between GPUs.
- –Offloading layers to slower system memory causes a non-linear performance drop governed by Amdahl's Law, creating pipeline stalls as the fast eGPU waits for the APU.
- –The results pave the way for building massive, high-capacity local inference rigs by pooling cheap unified APU memory with fast dedicated eGPU VRAM.
DISCOVERED
49d ago
2026-04-08
PUBLISHED
50d ago
2026-04-07
RELEVANCE
AUTHOR
xspider2000