BACK_TO_FEEDAICRIER_2
MoE serving thread asks for hot-cold expert caching
OPEN_SOURCE ↗
REDDIT · REDDIT// 31d agoINFRASTRUCTURE

MoE serving thread asks for hot-cold expert caching

A LocalLLaMA Reddit thread asks whether inference stacks like vLLM and SGLang can keep frequently used MoE experts in VRAM while offloading colder experts to RAM or disk. It is a sharp infrastructure question, because MoE routing is often highly skewed in practice, but current serving stacks still focus more on parallelism and throughput than usage-aware expert placement.

// ANALYSIS

This is a real systems problem, not forum-bike-shedding: once MoE models hit constrained hardware, expert placement becomes a first-class serving knob. The notable signal is that SGLang's KTransformers roadmap already calls out “hotness aware expert distribution,” which makes this look more like an incoming optimization path than a niche idea.

  • vLLM publicly emphasizes high-throughput, memory-efficient serving and expert-parallel deployment, but its docs do not frame expert scheduling as hot/cold expert caching
  • SGLang has an open hybrid CPU/GPU MoE effort that explicitly lists hotness-aware expert distribution on its roadmap
  • For local and cost-sensitive deployments, keeping hot experts resident in VRAM could matter more than another incremental benchmark gain
  • The hard part is workload drift: expert popularity changes over time, so bad scheduling could add transfer stalls and cancel out the memory savings
// TAGS
vllminferencellmgpudevtool

DISCOVERED

31d ago

2026-03-11

PUBLISHED

33d ago

2026-03-10

RELEVANCE

6/ 10

AUTHOR

sayamss