Flash-MoE hits 13 tps with Qwen 3.5-397B
A benchmark on a 128GB M5 Max MacBook Pro shows Flash-MoE running Qwen 3.5-397B locally at nearly 13 tokens per second. Using the Anemll fork with Metal 4 optimizations and cache-io-split 4, the tester reported a 3x gain over prior M3 Max results and better performance from 4-bit quantization than 2-bit variants.
Local inference for "God-sized" models is now officially interactive on consumer hardware, proving that unified memory architectures are fundamentally changing the economics of LLM deployment.
* The performance peak at --cache-io-split 4 suggests that software-defined storage parallelism must be tightly coupled to the hardware's internal SSD controller logic for maximum throughput.
* 2-bit quantization is a "false economy" on M5 Max; the SSD is fast enough that I/O is no longer the bottleneck, making the dequantization overhead and quality loss of 2-bit quants unacceptable.
* Metal 4's NAX support is a crucial generational upgrade, allowing for much more efficient asynchronous expert loading from flash storage during the forward pass.
DISCOVERED
18d ago
2026-03-25
PUBLISHED
18d ago
2026-03-25
RELEVANCE
AUTHOR
Equivalent-Buy1706