Dynamic expert cache speeds ik_llama.cpp inference
A developer has implemented a "Hot Expert Cache" for MoE (Mixture of Experts) models in llama.cpp, significantly improving inference speeds on consumer hardware with limited VRAM. By tracking which experts are most frequently routed for the past N tokens and dynamically loading them into VRAM, the implementation achieved a 26.8% speedup in token generation compared to standard layer-based offloading on an RTX 4090/Ryzen 9 setup. This optimization allows for more efficient use of system resources, providing a smoother experience for massive models like Qwen3.5-122B-A10B without requiring a unified memory system.
This dynamic caching strategy bridges the gap between full GPU offloading and slow CPU-only inference for massive MoE models by prioritizing VRAM for active model components. It achieves a 26.8% performance improvement over traditional offloading and a 44.8% boost over CPU-only baselines, increasing generation speeds for 122B models to 22.7 tok/s on consumer hardware.
DISCOVERED
6h ago
2026-04-15
PUBLISHED
7h ago
2026-04-15
RELEVANCE
AUTHOR
TriWrite