Anemll Flash-MLX speeds MoE on Apple Silicon
Anemll open-sourced anemll-flash-mlx, a focused MLX toolkit for running large Mixture-of-Experts models on Apple Silicon. It keeps the dense path in MLX and handles MoE experts with a stable slot-bank, hit/miss separation, SSD-backed streaming, and support for mlx-community checkpoints plus mixed or dynamic quant sidecars.
This is the right kind of narrow infra bet: it attacks the MoE bottleneck instead of trying to rebuild MLX around sparse experts, which makes it feel like serious runtime plumbing for Apple Silicon locality rather than a polished app for casual users. The repo's own M5 Max 128 GB benchmarks are promising: resident-pread hits 101.6 tok/s on Qwen3.5-35B-A3B-4bit, and the best SSD-backed mode still clears 47 tok/s. The slot-bank split is the key idea because it keeps execution shape stable while only reloading missed experts instead of rebuilding per-token expert sets. Support for mlx-community checkpoints and mixed or dynamic quant sidecars matters because the real friction in these experiments is getting the model into a shape the runtime can actually use. Adjacent MLX projects like ZMLX are optimizing MoE decode from the kernel side, while Anemll is tackling the expert-storage and streaming half of the problem; the teased llama.cpp fork hints that this could grow into a broader sparse-inference stack rather than staying a one-off experiment.
DISCOVERED
12d ago
2026-03-30
PUBLISHED
12d ago
2026-03-30
RELEVANCE
AUTHOR
Competitive-Bake4602