BACK_TO_FEEDAICRIER_2
Dynamic expert cache speeds ik_llama.cpp inference
OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoOPENSOURCE RELEASE

Dynamic expert cache speeds ik_llama.cpp inference

A developer has implemented a "Hot Expert Cache" for MoE (Mixture of Experts) models in llama.cpp, significantly improving inference speeds on consumer hardware with limited VRAM. By tracking which experts are most frequently routed for the past N tokens and dynamically loading them into VRAM, the implementation achieved a 26.8% speedup in token generation compared to standard layer-based offloading on an RTX 4090/Ryzen 9 setup. This optimization allows for more efficient use of system resources, providing a smoother experience for massive models like Qwen3.5-122B-A10B without requiring a unified memory system.

// ANALYSIS

This dynamic caching strategy bridges the gap between full GPU offloading and slow CPU-only inference for massive MoE models by prioritizing VRAM for active model components. It achieves a 26.8% performance improvement over traditional offloading and a 44.8% boost over CPU-only baselines, increasing generation speeds for 122B models to 22.7 tok/s on consumer hardware.

// TAGS
ik-llama-cppqwen-3-5moevramoptimizationllminference

DISCOVERED

6h ago

2026-04-15

PUBLISHED

7h ago

2026-04-15

RELEVANCE

8/ 10

AUTHOR

TriWrite