OPEN_SOURCE ↗
REDDIT · REDDIT// 22d agoBENCHMARK RESULT
M5 Max LLM tests hit 72.8 tok/s
This self-reported benchmark from a new Apple M5 Max 128GB machine shows local LLM inference is now very usable, with DeepSeek-R1 8B topping the chart at 72.8 tok/s. The most interesting result is that runtime choice matters almost as much as model size, with Qwen 3.5 27B running far faster in MLX than in llama.cpp.
// ANALYSIS
This reads less like a chip brag and more like a preview of how Apple Silicon local AI workflows will actually be built: tiered models, runtime-specific routing, and memory-bandwidth-aware model choice.
- –The 614 GB/s unified memory ceiling is clearly driving throughput; the results scale closely with model size, which is exactly what you want to see from a bandwidth-bound workload.
- –MLX’s 31.6 tok/s on Qwen 3.5 27B versus llama.cpp’s 16.5 tok/s is the headline technical surprise, and it reinforces how much framework optimization still matters on Apple Silicon.
- –DeepSeek-R1 8B looks like the practical everyday model here: fast enough for interactive use, but still capable enough to keep the assistant feeling smart.
- –The 72B result is slow but viable, which makes a semantic router feel less like a hobby project and more like the right product pattern for local AI.
- –Because these are self-run benchmarks, the exact numbers will vary with prompt mix, context length, and software revisions, but the overall ranking is still very believable.
// TAGS
benchmarkllminferencegpum5-max
DISCOVERED
22d ago
2026-03-21
PUBLISHED
22d ago
2026-03-21
RELEVANCE
8/ 10
AUTHOR
affenhoden