Expert swapping throttles MoE speed to 20%

// 94d agoBENCHMARK RESULT

Expert swapping throttles MoE speed to 20%

A consensus among the r/LocalLLaMA community indicates that Mixture-of-Experts (MoE) models typically run at only 10% to 20% of their peak "full speed" when hardware constraints force only the active parameters into VRAM while the rest reside in system RAM. The performance bottleneck is primarily attributed to the massive overhead of swapping non-active experts across the PCIe bus for every token generated, which lacks the bandwidth of high-speed VRAM or unified memory architectures.

// ANALYSIS

Running massive models like Qwen 397B on consumer GPUs is a triumph of capacity over latency, but the expert swapping tax is brutal for real-time use. PCIe bandwidth remains the ultimate bottleneck, offering significantly less throughput than high-end VRAM or HBM. Achieving full MoE potential requires the entire parameter set to reside in high-bandwidth memory to avoid the massive overhead of pulling data across the bus for every token.

// TAGS

llmgpuinferenceqwen3.5-397bmoelocal-llama

DISCOVERED

94d ago

2026-04-10

PUBLISHED

94d ago

2026-04-10

RELEVANCE

8/ 10

AUTHOR

DeepOrangeSky

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS20m ago

swyx outlines specialized multi-model AI workflow

In a recent tweet, swyx shared his multi-model AI stack for complex projects, assigning specialized tasks to models like sol ultra for planning, fable 5 for critiquing, and sonnet 5 for code generation. He also highlighted the importance of interactive, interview-style prompting to clarify design decisions.

NEWS22m ago

Tweet mocks Claude Fable 5 safety filters

Indie developer Pieter Levels (@levelsio) shared a post mocking the overly sensitive safety guardrails of Anthropic's Claude Fable 5 AI model. The message satirizes Fable's warning system by claiming a 'life simulation' was downgraded to Opus 4.5 without appeal, highlighting developer frustration with aggressive safety routing.

LAUNCH48m ago

Brockman highlights ChatGPT Work mobile experience

OpenAI President and Co-founder Greg Brockman shared his enthusiasm for ChatGPT Work, noting that while the new agent-based platform has received less attention than other recent updates, it offers a highly functional and impressive mobile experience. Powered by the GPT-5.6 model family, ChatGPT Work transitions ChatGPT from a conversational chatbot into an autonomous agent capable of executing complex, multi-step workflows and cross-app integrations directly from mobile and desktop interfaces.