mlx-lm adds native Qwen3.5 MTP
MLX-LM is adding native multi-token prediction support for Qwen3.5 checkpoints, letting Apple Silicon users generate multiple tokens per forward pass. Early benchmarks on a Qwen3.5-27B 4-bit model hit 23.3 tok/s from 15.3 tok/s, with an 80.6% acceptance rate.
This is a real local-inference win, not just a lab trick, because the speedup comes from using Qwen3.5’s built-in MTP head rather than bolting on a separate draft model. The caveat is that the feature is optimized for single-stream generation, so batching and some model variants may not benefit as much.
- –Benchmarks on an M4 Pro show about a 1.5x throughput lift, which is the kind of improvement users can feel immediately.
- –MTP disables batch serving, so this is best for interactive chats and solo generation rather than high-concurrency workloads.
- –Users may need to reconvert checkpoints to preserve `mtp.*` weights before the new path works correctly.
- –The PR discussion suggests dense Qwen3.5 checkpoints see better acceptance than MoE variants, so gains may vary by model size.
DISCOVERED
69d ago
2026-03-21
PUBLISHED
69d ago
2026-03-21
RELEVANCE
AUTHOR
be566
