OPEN_SOURCE ↗
REDDIT · REDDIT// 4d agoBENCHMARK RESULT
llama.cpp benchmarks validate cross-vendor eGPU inference
A developer benchmarked llama.cpp's Vulkan backend on a Strix Halo APU paired with an RTX 5070 Ti eGPU via OCuLink, proving cross-vendor tensor splitting works seamlessly. The tests debunk the myth that OCuLink's PCIe 4.0 x4 bandwidth bottlenecks local LLM token generation.
// ANALYSIS
This deep dive into heterogeneous inference confirms that memory bandwidth, not PCIe constraints, dictates local LLM performance.
- –Vulkan abstracts the hardware layer to stably combine AMD and NVIDIA architectures with only a 5-10% performance penalty compared to native CUDA or ROCm.
- –OCuLink bandwidth uses less than 1% of its capacity during active token generation, as only tiny activation tensors are passed between GPUs.
- –Offloading layers to slower system memory causes a non-linear performance drop governed by Amdahl's Law, creating pipeline stalls as the fast eGPU waits for the APU.
- –The results pave the way for building massive, high-capacity local inference rigs by pooling cheap unified APU memory with fast dedicated eGPU VRAM.
// TAGS
llama-cppllminferencegpubenchmark
DISCOVERED
4d ago
2026-04-08
PUBLISHED
4d ago
2026-04-07
RELEVANCE
8/ 10
AUTHOR
xspider2000