BACK_TO_FEEDAICRIER_2
vLLM PR adds dynamic MoE expert caching
OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoINFRASTRUCTURE

vLLM PR adds dynamic MoE expert caching

A new open pull request to vLLM proposes dynamic MoE expert caching that keeps a configurable hot set of experts in GPU VRAM and offloads the rest to CPU pinned memory, with CPU fallback on cache misses. The author reports fitting a 16 GB-class MoE workload onto an 8 GB GPU using --moe-expert-cache-size, framing it as a practical path for memory-constrained local inference.

// ANALYSIS

This is the kind of gritty infra patch that could matter more than flashy model launches if it lands cleanly upstream.

  • The PR is currently open (not merged), so real-world benefit depends on review outcome, kernel compatibility, and production hardening.
  • It directly targets a known vLLM pain point around dynamic MoE offloading, where community discussions previously said this was not supported.
  • The design aligns with the core MoE reality: expert usage is skewed, so an LRU hot-expert cache can cut VRAM pressure with acceptable latency tradeoffs.
  • CPU fallback on miss is a pragmatic choice for single-GPU users, but prefill-heavy or highly diverse routing workloads may still see latency spikes.
  • If follow-up work (MXFP4, disk streaming, two-tier cache, EP/DP integration) ships, this could become a strong unlock for local and budget deployments.
// TAGS
vllmllminferencegpuopen-sourceself-hosted

DISCOVERED

25d ago

2026-03-17

PUBLISHED

25d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

king_of_jupyter