Dynamic expert cache speeds ik_llama.cpp inference

// 90d agoOPENSOURCE RELEASE

Dynamic expert cache speeds ik_llama.cpp inference

A developer has implemented a "Hot Expert Cache" for MoE (Mixture of Experts) models in llama.cpp, significantly improving inference speeds on consumer hardware with limited VRAM. By tracking which experts are most frequently routed for the past N tokens and dynamically loading them into VRAM, the implementation achieved a 26.8% speedup in token generation compared to standard layer-based offloading on an RTX 4090/Ryzen 9 setup. This optimization allows for more efficient use of system resources, providing a smoother experience for massive models like Qwen3.5-122B-A10B without requiring a unified memory system.

// ANALYSIS

This dynamic caching strategy bridges the gap between full GPU offloading and slow CPU-only inference for massive MoE models by prioritizing VRAM for active model components. It achieves a 26.8% performance improvement over traditional offloading and a 44.8% boost over CPU-only baselines, increasing generation speeds for 122B models to 22.7 tok/s on consumer hardware.

// TAGS

ik-llama-cppqwen-3-5moevramoptimizationllminference

DISCOVERED

90d ago

2026-04-15

PUBLISHED

90d ago

2026-04-15

RELEVANCE

8/ 10

AUTHOR

TriWrite

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

TUTORIAL1h ago

Stability AI, Music Hackspace host Stable Audio workshop

Music Hackspace is partnering with Stability AI to host a free virtual workshop on Stable Audio 3.0 on July 16, 2026. Led by researcher CJ Carr, the session will teach creative coders how to run the model locally, train custom LoRAs, and integrate it with Ableton.

TUTORIAL2h ago

Maths CS AI Compendium delivers AI research curriculum

The maths-cs-ai-compendium is a popular open-source repository on GitHub curated by Henry Ndubuaku. It provides a comprehensive collection of resources and a learning path spanning mathematics, computer science, and artificial intelligence, aimed at helping individuals become highly skilled AI and Machine Learning Research Engineers.

OPEN SOURCE2h ago

grok2api provides multi-account API gateway for Grok models

grok2api is an open-source multi-account API gateway written in Go that translates Grok Build, Grok Web, and Grok Console capabilities into an accessible API. It enables developers to seamlessly integrate xAI's Grok models into their own applications or third-party clients by acting as an efficient proxy, garnering significant attention from the open-source community.