OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoPRODUCT UPDATE
mlx-lm adds native Qwen3.5 MTP
MLX-LM is adding native multi-token prediction support for Qwen3.5 checkpoints, letting Apple Silicon users generate multiple tokens per forward pass. Early benchmarks on a Qwen3.5-27B 4-bit model hit 23.3 tok/s from 15.3 tok/s, with an 80.6% acceptance rate.
// ANALYSIS
This is a real local-inference win, not just a lab trick, because the speedup comes from using Qwen3.5’s built-in MTP head rather than bolting on a separate draft model. The caveat is that the feature is optimized for single-stream generation, so batching and some model variants may not benefit as much.
- –Benchmarks on an M4 Pro show about a 1.5x throughput lift, which is the kind of improvement users can feel immediately.
- –MTP disables batch serving, so this is best for interactive chats and solo generation rather than high-concurrency workloads.
- –Users may need to reconvert checkpoints to preserve `mtp.*` weights before the new path works correctly.
- –The PR discussion suggests dense Qwen3.5 checkpoints see better acceptance than MoE variants, so gains may vary by model size.
// TAGS
mlx-lmqwen3-5llminferencebenchmarkopen-source
DISCOVERED
21d ago
2026-03-21
PUBLISHED
22d ago
2026-03-21
RELEVANCE
8/ 10
AUTHOR
be566