BACK_TO_FEEDAICRIER_2
Flash-MoE runs 400B models on 48GB MacBook
OPEN_SOURCE ↗
YT · YOUTUBE// 21d agoOPENSOURCE RELEASE

Flash-MoE runs 400B models on 48GB MacBook

Flash-MoE is a pure C/Metal inference engine for Apple Silicon that streams 400B parameter models from SSD to GPU on demand. It enables interactive-speed inference of massive Mixture-of-Experts models on consumer hardware by bypassing traditional RAM limitations.

// ANALYSIS

Flash-MoE is a masterclass in hardware optimization, proving that smarter I/O—not just more VRAM—is the key to unlocking massive local models. SSD expert streaming allows the Qwen 3.5 397B model to run on a 48GB MacBook Pro, achieving ~4.4 tokens/s via parallel pread() calls. The pure C/Metal implementation completely bypasses Python overhead, using hand-tuned shaders and FMA optimization for a 12% performance boost. It leverages the macOS page cache to manage weight streaming, demonstrating that modern NVMe speeds are finally sufficient for real-time inference. Support for 4-bit quantization maintains full tool-calling capabilities, while 2-bit modes push experimental speeds toward 7 tokens/s. This lowers the barrier for state-of-the-art open models, making 400B-class intelligence accessible to developers without server-grade clusters.

// TAGS
flash-moeinferenceedge-aillmopen-sourcegpuapple-silicon

DISCOVERED

21d ago

2026-03-22

PUBLISHED

21d ago

2026-03-22

RELEVANCE

9/ 10

AUTHOR

Github Awesome