vLLM PR adds dynamic MoE expert caching

// 116d agoINFRASTRUCTURE

vLLM PR adds dynamic MoE expert caching

A new open pull request to vLLM proposes dynamic MoE expert caching that keeps a configurable hot set of experts in GPU VRAM and offloads the rest to CPU pinned memory, with CPU fallback on cache misses. The author reports fitting a 16 GB-class MoE workload onto an 8 GB GPU using --moe-expert-cache-size, framing it as a practical path for memory-constrained local inference.

// ANALYSIS

This is the kind of gritty infra patch that could matter more than flashy model launches if it lands cleanly upstream.

–The PR is currently open (not merged), so real-world benefit depends on review outcome, kernel compatibility, and production hardening.
–It directly targets a known vLLM pain point around dynamic MoE offloading, where community discussions previously said this was not supported.
–The design aligns with the core MoE reality: expert usage is skewed, so an LRU hot-expert cache can cut VRAM pressure with acceptable latency tradeoffs.
–CPU fallback on miss is a pragmatic choice for single-GPU users, but prefill-heavy or highly diverse routing workloads may still see latency spikes.
–If follow-up work (MXFP4, disk streaming, two-tier cache, EP/DP integration) ships, this could become a strong unlock for local and budget deployments.

// TAGS

vllmllminferencegpuopen-sourceself-hosted

DISCOVERED

116d ago

2026-03-17

PUBLISHED

116d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

king_of_jupyter

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE1h ago

OpenAI launches ChatGPT browser, desktop automation

OpenAI has released new settings for ChatGPT that allow the assistant to browse the web autonomously and execute actions across local desktop applications. Powered by the new GPT-5.6 model family, these features transform ChatGPT from a text-based conversational partner into an agentic tool capable of navigating user environments to perform multi-step tasks.

NEWS3h ago

Zebra stripes trick drone vision AI

Forces in the Ukraine war are painting military vehicles with high-contrast zebra patterns to trick autonomous drone machine-vision algorithms. However, experts note this tactic only offers a temporary advantage as training datasets are quickly updated to recognize the new camouflage.

OPEN SOURCE4h ago

Nuxt surpasses 60,000 GitHub stars

Nuxt, the open-source Vue.js framework, has surpassed 60,000 stars on GitHub, solidifying its position as a leading tool for full-stack web development.

vLLM PR adds dynamic MoE expert caching