BACK_TO_FEEDAICRIER_2
Flash-MoE hits 13 tps with Qwen 3.5-397B
OPEN_SOURCE ↗
REDDIT · REDDIT// 18d agoBENCHMARK RESULT

Flash-MoE hits 13 tps with Qwen 3.5-397B

A benchmark on a 128GB M5 Max MacBook Pro shows Flash-MoE running Qwen 3.5-397B locally at nearly 13 tokens per second. Using the Anemll fork with Metal 4 optimizations and cache-io-split 4, the tester reported a 3x gain over prior M3 Max results and better performance from 4-bit quantization than 2-bit variants.

// ANALYSIS

Local inference for "God-sized" models is now officially interactive on consumer hardware, proving that unified memory architectures are fundamentally changing the economics of LLM deployment.

* The performance peak at --cache-io-split 4 suggests that software-defined storage parallelism must be tightly coupled to the hardware's internal SSD controller logic for maximum throughput.

* 2-bit quantization is a "false economy" on M5 Max; the SSD is fast enough that I/O is no longer the bottleneck, making the dequantization overhead and quality loss of 2-bit quants unacceptable.

* Metal 4's NAX support is a crucial generational upgrade, allowing for much more efficient asynchronous expert loading from flash storage during the forward pass.

// TAGS
llmmoemacbookm5 maxqwenbenchmarklocal aiapple siliconflash-moe

DISCOVERED

18d ago

2026-03-25

PUBLISHED

18d ago

2026-03-25

RELEVANCE

9/ 10

AUTHOR

Equivalent-Buy1706