Flash-MoE runs 400B models on 48GB MacBook

// 113d agoOPENSOURCE RELEASE

Flash-MoE runs 400B models on 48GB MacBook

Flash-MoE is a pure C/Metal inference engine for Apple Silicon that streams 400B parameter models from SSD to GPU on demand. It enables interactive-speed inference of massive Mixture-of-Experts models on consumer hardware by bypassing traditional RAM limitations.

// ANALYSIS

Flash-MoE is a masterclass in hardware optimization, proving that smarter I/O—not just more VRAM—is the key to unlocking massive local models. SSD expert streaming allows the Qwen 3.5 397B model to run on a 48GB MacBook Pro, achieving ~4.4 tokens/s via parallel pread() calls. The pure C/Metal implementation completely bypasses Python overhead, using hand-tuned shaders and FMA optimization for a 12% performance boost. It leverages the macOS page cache to manage weight streaming, demonstrating that modern NVMe speeds are finally sufficient for real-time inference. Support for 4-bit quantization maintains full tool-calling capabilities, while 2-bit modes push experimental speeds toward 7 tokens/s. This lowers the barrier for state-of-the-art open models, making 400B-class intelligence accessible to developers without server-grade clusters.

// TAGS

flash-moeinferenceedge-aillmopen-sourcegpuapple-silicon

DISCOVERED

113d ago

2026-03-22

PUBLISHED

113d ago

2026-03-22

RELEVANCE

9/ 10

AUTHOR

Github Awesome

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS20m ago

swyx outlines specialized multi-model AI workflow

In a recent tweet, swyx shared his multi-model AI stack for complex projects, assigning specialized tasks to models like sol ultra for planning, fable 5 for critiquing, and sonnet 5 for code generation. He also highlighted the importance of interactive, interview-style prompting to clarify design decisions.

NEWS23m ago

Tweet mocks Claude Fable 5 safety filters

Indie developer Pieter Levels (@levelsio) shared a post mocking the overly sensitive safety guardrails of Anthropic's Claude Fable 5 AI model. The message satirizes Fable's warning system by claiming a 'life simulation' was downgraded to Opus 4.5 without appeal, highlighting developer frustration with aggressive safety routing.

LAUNCH49m ago

Brockman highlights ChatGPT Work mobile experience

OpenAI President and Co-founder Greg Brockman shared his enthusiasm for ChatGPT Work, noting that while the new agent-based platform has received less attention than other recent updates, it offers a highly functional and impressive mobile experience. Powered by the GPT-5.6 model family, ChatGPT Work transitions ChatGPT from a conversational chatbot into an autonomous agent capable of executing complex, multi-step workflows and cross-app integrations directly from mobile and desktop interfaces.