mlx-lm adds native Qwen3.5 MTP

// 114d agoPRODUCT UPDATE

mlx-lm adds native Qwen3.5 MTP

MLX-LM is adding native multi-token prediction support for Qwen3.5 checkpoints, letting Apple Silicon users generate multiple tokens per forward pass. Early benchmarks on a Qwen3.5-27B 4-bit model hit 23.3 tok/s from 15.3 tok/s, with an 80.6% acceptance rate.

// ANALYSIS

This is a real local-inference win, not just a lab trick, because the speedup comes from using Qwen3.5’s built-in MTP head rather than bolting on a separate draft model. The caveat is that the feature is optimized for single-stream generation, so batching and some model variants may not benefit as much.

–Benchmarks on an M4 Pro show about a 1.5x throughput lift, which is the kind of improvement users can feel immediately.
–MTP disables batch serving, so this is best for interactive chats and solo generation rather than high-concurrency workloads.
–Users may need to reconvert checkpoints to preserve `mtp.*` weights before the new path works correctly.
–The PR discussion suggests dense Qwen3.5 checkpoints see better acceptance than MoE variants, so gains may vary by model size.

// TAGS

mlx-lmqwen3-5llminferencebenchmarkopen-source

DISCOVERED

114d ago

2026-03-21

PUBLISHED

114d ago

2026-03-21

RELEVANCE

8/ 10

AUTHOR

be566

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE47m ago

scroll-world launches scroll-driven 3D flight skill

scroll-world is an open-source, framework-agnostic agent skill that leverages Higgsfield to generate immersive, scroll-driven 3D camera flights through diorama scenes for landing pages. By rendering seamless connection clips between neighboring frames, it allows developers to build interactive 3D narrative websites navigated simply by scrolling, without requiring heavy game engines.

MODEL2h ago

OpenAI GPT-5.6 hits Amazon Bedrock

OpenAI's GPT-5.6 model family—including Sol, Terra, and Luna—is now generally available on Amazon Bedrock. Running on Bedrock's next-generation inference engine, the models support prompt caching with a 90% discount and match OpenAI's first-party pricing.

UPDATE2h ago

OpenRouter splits rankings by model weight

OpenRouter has updated its rankings platform by introducing separate leaderboards for open-weight and closed-weight models. This allows developers to track and compare usage statistics of proprietary, API-exclusive models against downloadable open-weight models.