BACK_TO_FEEDAICRIER_2
mlx-lm adds native Qwen3.5 MTP
OPEN_SOURCE ↗
REDDIT · REDDIT// 21d agoPRODUCT UPDATE

mlx-lm adds native Qwen3.5 MTP

MLX-LM is adding native multi-token prediction support for Qwen3.5 checkpoints, letting Apple Silicon users generate multiple tokens per forward pass. Early benchmarks on a Qwen3.5-27B 4-bit model hit 23.3 tok/s from 15.3 tok/s, with an 80.6% acceptance rate.

// ANALYSIS

This is a real local-inference win, not just a lab trick, because the speedup comes from using Qwen3.5’s built-in MTP head rather than bolting on a separate draft model. The caveat is that the feature is optimized for single-stream generation, so batching and some model variants may not benefit as much.

  • Benchmarks on an M4 Pro show about a 1.5x throughput lift, which is the kind of improvement users can feel immediately.
  • MTP disables batch serving, so this is best for interactive chats and solo generation rather than high-concurrency workloads.
  • Users may need to reconvert checkpoints to preserve `mtp.*` weights before the new path works correctly.
  • The PR discussion suggests dense Qwen3.5 checkpoints see better acceptance than MoE variants, so gains may vary by model size.
// TAGS
mlx-lmqwen3-5llminferencebenchmarkopen-source

DISCOVERED

21d ago

2026-03-21

PUBLISHED

22d ago

2026-03-21

RELEVANCE

8/ 10

AUTHOR

be566