OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoINFRASTRUCTURE
vLLM PR adds dynamic MoE expert caching
A new open pull request to vLLM proposes dynamic MoE expert caching that keeps a configurable hot set of experts in GPU VRAM and offloads the rest to CPU pinned memory, with CPU fallback on cache misses. The author reports fitting a 16 GB-class MoE workload onto an 8 GB GPU using --moe-expert-cache-size, framing it as a practical path for memory-constrained local inference.
// ANALYSIS
This is the kind of gritty infra patch that could matter more than flashy model launches if it lands cleanly upstream.
- –The PR is currently open (not merged), so real-world benefit depends on review outcome, kernel compatibility, and production hardening.
- –It directly targets a known vLLM pain point around dynamic MoE offloading, where community discussions previously said this was not supported.
- –The design aligns with the core MoE reality: expert usage is skewed, so an LRU hot-expert cache can cut VRAM pressure with acceptable latency tradeoffs.
- –CPU fallback on miss is a pragmatic choice for single-GPU users, but prefill-heavy or highly diverse routing workloads may still see latency spikes.
- –If follow-up work (MXFP4, disk streaming, two-tier cache, EP/DP integration) ships, this could become a strong unlock for local and budget deployments.
// TAGS
vllmllminferencegpuopen-sourceself-hosted
DISCOVERED
25d ago
2026-03-17
PUBLISHED
25d ago
2026-03-17
RELEVANCE
8/ 10
AUTHOR
king_of_jupyter