YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Flash-MoE runs 400B models on 48GB MacBook

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Flash-MoE runs 400B models on 48GB MacBook
OPEN LINK ↗
// 68d agoOPENSOURCE RELEASE

Flash-MoE runs 400B models on 48GB MacBook

Flash-MoE is a pure C/Metal inference engine for Apple Silicon that streams 400B parameter models from SSD to GPU on demand. It enables interactive-speed inference of massive Mixture-of-Experts models on consumer hardware by bypassing traditional RAM limitations.

// ANALYSIS

Flash-MoE is a masterclass in hardware optimization, proving that smarter I/O—not just more VRAM—is the key to unlocking massive local models. SSD expert streaming allows the Qwen 3.5 397B model to run on a 48GB MacBook Pro, achieving ~4.4 tokens/s via parallel pread() calls. The pure C/Metal implementation completely bypasses Python overhead, using hand-tuned shaders and FMA optimization for a 12% performance boost. It leverages the macOS page cache to manage weight streaming, demonstrating that modern NVMe speeds are finally sufficient for real-time inference. Support for 4-bit quantization maintains full tool-calling capabilities, while 2-bit modes push experimental speeds toward 7 tokens/s. This lowers the barrier for state-of-the-art open models, making 400B-class intelligence accessible to developers without server-grade clusters.

// TAGS
flash-moeinferenceedge-aillmopen-sourcegpuapple-silicon

DISCOVERED

68d ago

2026-03-22

PUBLISHED

68d ago

2026-03-22

RELEVANCE

9/ 10

AUTHOR

Github Awesome