BACK_TO_FEEDAICRIER_2
Anemll Flash-MLX speeds MoE on Apple Silicon
OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoOPENSOURCE RELEASE

Anemll Flash-MLX speeds MoE on Apple Silicon

Anemll open-sourced anemll-flash-mlx, a focused MLX toolkit for running large Mixture-of-Experts models on Apple Silicon. It keeps the dense path in MLX and handles MoE experts with a stable slot-bank, hit/miss separation, SSD-backed streaming, and support for mlx-community checkpoints plus mixed or dynamic quant sidecars.

// ANALYSIS

This is the right kind of narrow infra bet: it attacks the MoE bottleneck instead of trying to rebuild MLX around sparse experts, which makes it feel like serious runtime plumbing for Apple Silicon locality rather than a polished app for casual users. The repo's own M5 Max 128 GB benchmarks are promising: resident-pread hits 101.6 tok/s on Qwen3.5-35B-A3B-4bit, and the best SSD-backed mode still clears 47 tok/s. The slot-bank split is the key idea because it keeps execution shape stable while only reloading missed experts instead of rebuilding per-token expert sets. Support for mlx-community checkpoints and mixed or dynamic quant sidecars matters because the real friction in these experiments is getting the model into a shape the runtime can actually use. Adjacent MLX projects like ZMLX are optimizing MoE decode from the kernel side, while Anemll is tackling the expert-storage and streaming half of the problem; the teased llama.cpp fork hints that this could grow into a broader sparse-inference stack rather than staying a one-off experiment.

// TAGS
anemll-flash-mlxllminferenceopen-sourceedge-aidevtool

DISCOVERED

12d ago

2026-03-30

PUBLISHED

12d ago

2026-03-30

RELEVANCE

8/ 10

AUTHOR

Competitive-Bake4602