Expert swapping throttles MoE speed to 20%
A consensus among the r/LocalLLaMA community indicates that Mixture-of-Experts (MoE) models typically run at only 10% to 20% of their peak "full speed" when hardware constraints force only the active parameters into VRAM while the rest reside in system RAM. The performance bottleneck is primarily attributed to the massive overhead of swapping non-active experts across the PCIe bus for every token generated, which lacks the bandwidth of high-speed VRAM or unified memory architectures.
Running massive models like Qwen 397B on consumer GPUs is a triumph of capacity over latency, but the expert swapping tax is brutal for real-time use. PCIe bandwidth remains the ultimate bottleneck, offering significantly less throughput than high-end VRAM or HBM. Achieving full MoE potential requires the entire parameter set to reside in high-bandwidth memory to avoid the massive overhead of pulling data across the bus for every token.
DISCOVERED
1d ago
2026-04-10
PUBLISHED
1d ago
2026-04-10
RELEVANCE
AUTHOR
DeepOrangeSky