M5 Max LLM tests hit 72.8 tok/s
This self-reported benchmark from a new Apple M5 Max 128GB machine shows local LLM inference is now very usable, with DeepSeek-R1 8B topping the chart at 72.8 tok/s. The most interesting result is that runtime choice matters almost as much as model size, with Qwen 3.5 27B running far faster in MLX than in llama.cpp.
This reads less like a chip brag and more like a preview of how Apple Silicon local AI workflows will actually be built: tiered models, runtime-specific routing, and memory-bandwidth-aware model choice.
- –The 614 GB/s unified memory ceiling is clearly driving throughput; the results scale closely with model size, which is exactly what you want to see from a bandwidth-bound workload.
- –MLX’s 31.6 tok/s on Qwen 3.5 27B versus llama.cpp’s 16.5 tok/s is the headline technical surprise, and it reinforces how much framework optimization still matters on Apple Silicon.
- –DeepSeek-R1 8B looks like the practical everyday model here: fast enough for interactive use, but still capable enough to keep the assistant feeling smart.
- –The 72B result is slow but viable, which makes a semantic router feel less like a hobby project and more like the right product pattern for local AI.
- –Because these are self-run benchmarks, the exact numbers will vary with prompt mix, context length, and software revisions, but the overall ranking is still very believable.
DISCOVERED
67d ago
2026-03-21
PUBLISHED
68d ago
2026-03-21
RELEVANCE
AUTHOR
affenhoden