MoE serving thread asks for hot-cold expert caching
A LocalLLaMA Reddit thread asks whether inference stacks like vLLM and SGLang can keep frequently used MoE experts in VRAM while offloading colder experts to RAM or disk. It is a sharp infrastructure question, because MoE routing is often highly skewed in practice, but current serving stacks still focus more on parallelism and throughput than usage-aware expert placement.
This is a real systems problem, not forum-bike-shedding: once MoE models hit constrained hardware, expert placement becomes a first-class serving knob. The notable signal is that SGLang's KTransformers roadmap already calls out “hotness aware expert distribution,” which makes this look more like an incoming optimization path than a niche idea.
- –vLLM publicly emphasizes high-throughput, memory-efficient serving and expert-parallel deployment, but its docs do not frame expert scheduling as hot/cold expert caching
- –SGLang has an open hybrid CPU/GPU MoE effort that explicitly lists hotness-aware expert distribution on its roadmap
- –For local and cost-sensitive deployments, keeping hot experts resident in VRAM could matter more than another incremental benchmark gain
- –The hard part is workload drift: expert popularity changes over time, so bad scheduling could add transfer stalls and cancel out the memory savings
DISCOVERED
31d ago
2026-03-11
PUBLISHED
33d ago
2026-03-10
RELEVANCE
AUTHOR
sayamss