Flash-MoE runs 400B models on 48GB MacBook
Flash-MoE is a pure C/Metal inference engine for Apple Silicon that streams 400B parameter models from SSD to GPU on demand. It enables interactive-speed inference of massive Mixture-of-Experts models on consumer hardware by bypassing traditional RAM limitations.
Flash-MoE is a masterclass in hardware optimization, proving that smarter I/O—not just more VRAM—is the key to unlocking massive local models. SSD expert streaming allows the Qwen 3.5 397B model to run on a 48GB MacBook Pro, achieving ~4.4 tokens/s via parallel pread() calls. The pure C/Metal implementation completely bypasses Python overhead, using hand-tuned shaders and FMA optimization for a 12% performance boost. It leverages the macOS page cache to manage weight streaming, demonstrating that modern NVMe speeds are finally sufficient for real-time inference. Support for 4-bit quantization maintains full tool-calling capabilities, while 2-bit modes push experimental speeds toward 7 tokens/s. This lowers the barrier for state-of-the-art open models, making 400B-class intelligence accessible to developers without server-grade clusters.
DISCOVERED
21d ago
2026-03-22
PUBLISHED
21d ago
2026-03-22
RELEVANCE
AUTHOR
Github Awesome