
exo native MTP boosts Qwen3.6
The first exo contribution adds native multi-token prediction support for Qwen3.6-style MLX checkpoints, enabled by default on macOS unless EXO_NATIVE_MTP_ENABLED=0 is set. The author reports exactness parity against target-greedy decoding plus benchmark wins on 27B and 35B-A3B settings, along with model-card plumbing and generation-stat reporting.
Hot take: this looks like a real systems win, but only where draft overhead and verifier cost stay under control.
- –27B is the clean success case: K=2 and K=3 both land near 2x throughput versus MTP off, with K=2 slightly ahead in the broad sweep.
- –35B-A3B is more fragile: K=1 is the best setting, and higher K gives back the gain as verifier/cache costs dominate.
- –Exactness is the important part here: the recorded greedy probes matched target-greedy for the tested settings, so this is not just a speed hack.
- –The practical scope is still narrow: single-node only, explicit model-card metadata required, and stateful logits processors are not yet routed through native MTP.
DISCOVERED
17d ago
2026-05-23
PUBLISHED
17d ago
2026-05-23
RELEVANCE
AUTHOR
meaningego