MoE serving thread asks for hot-cold expert caching

// 93d agoINFRASTRUCTURE

MoE serving thread asks for hot-cold expert caching

A LocalLLaMA Reddit thread asks whether inference stacks like vLLM and SGLang can keep frequently used MoE experts in VRAM while offloading colder experts to RAM or disk. It is a sharp infrastructure question, because MoE routing is often highly skewed in practice, but current serving stacks still focus more on parallelism and throughput than usage-aware expert placement.

// ANALYSIS

This is a real systems problem, not forum-bike-shedding: once MoE models hit constrained hardware, expert placement becomes a first-class serving knob. The notable signal is that SGLang's KTransformers roadmap already calls out “hotness aware expert distribution,” which makes this look more like an incoming optimization path than a niche idea.

–vLLM publicly emphasizes high-throughput, memory-efficient serving and expert-parallel deployment, but its docs do not frame expert scheduling as hot/cold expert caching
–SGLang has an open hybrid CPU/GPU MoE effort that explicitly lists hotness-aware expert distribution on its roadmap
–For local and cost-sensitive deployments, keeping hot experts resident in VRAM could matter more than another incremental benchmark gain
–The hard part is workload drift: expert popularity changes over time, so bad scheduling could add transfer stalls and cancel out the memory savings

// TAGS

vllminferencellmgpudevtool

DISCOVERED

93d ago

2026-03-11

PUBLISHED

94d ago

2026-03-10

RELEVANCE

6/ 10

AUTHOR

sayamss

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL34m ago

Step 3.7 Flash launches on DeepInfra

DeepInfra has launched serverless API access for Step 3.7 Flash, a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model developed by StepFun. The model is specifically optimized for complex agentic workloads and features a 256K context window with selectable reasoning effort levels.

NEWS1h ago

Anthropic allegedly edits Mythos 5, Fable 5 system card

A user on X noticed discrepancies between the current system card for Anthropic's Mythos 5 and Fable 5 on their CDN and the version saved on launch day. Both versions display the date "June 9th" on the front page, leading to speculation that Anthropic silently edited the document without issuing an update or version bump.

VIDEO1h ago

User showcases Claude Fable 5 native PDF generation

A recent viral post on X by @agentnative_ demonstrates the remarkable ability of Anthropic's Claude Fable 5 model to create PDFs natively. The tweet highlights the practical document generation skills of the newly released "Mythos-class" AI model, drawing attention to its utility in advanced knowledge work and agentic workflows.