OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
llama.cpp users debate MoE bottlenecks
A LocalLLaMA thread digs into whether CPU RAM bandwidth or PCIe bandwidth dominates hybrid MoE inference when llama.cpp and ik_llama.cpp offload experts across CPU memory and multiple GPUs. The setup compares a higher-core EPYC system with more aggregate PCIe bandwidth against an Ice Lake Xeon system with AVX-512 and roughly double memory bandwidth.
// ANALYSIS
The useful takeaway is that there is no single bottleneck: prompt processing, token generation, expert placement, batch size, cache behavior, and PCIe topology all change the answer.
- –CPU-resident MoE experts can make RAM bandwidth matter, especially when selected experts are computed on CPU rather than copied to GPU.
- –When prompt processing offloads CPU-held weights back to GPU, PCIe bandwidth and main-GPU placement can become the painful limit.
- –ik_llama.cpp is relevant because it explicitly targets CPU/CUDA hybrid performance, fused MoE operations, tensor overrides, and graph split modes.
- –The Xeon upgrade may help CPU-side MoE and AVX-512 kernels, but fewer cores and weaker PCIe topology could hurt multi-GPU pipeline or transfer-heavy runs.
- –This is benchmark territory: `llama-bench` or sweep-style testing with the actual models, batch sizes, and split modes will beat theoretical bandwidth math.
// TAGS
llama-cppik-llama-cppinferencegpuself-hostedllmopen-source
DISCOVERED
4h ago
2026-04-21
PUBLISHED
6h ago
2026-04-21
RELEVANCE
7/ 10
AUTHOR
pixelterpy