Qwen3.6-27B MTP boosts Apple silicon speed
A MacBook M5 Max user benchmarked Qwen3.6-27B in llama.cpp and OpenWebUI and found the MTP build only gave a modest gain at first. After tuning speculative decoding with spec-draft-n-max 3 and spec-draft-p-min 0.75, throughput rose to 24.5 tps, and a coding prompt pushed the MTP variant to 27.70 tps versus 17.44 tps for the non-MTP model.
Hot take: this is not a universal 2x speedup story, it is a tuning story, and on an M5 Max the gains look solid only when the draft model is actually being accepted often enough. The initial config was likely too conservative for speculative decoding, so the first MTP result under-represented the model’s upside. Raising spec-draft-n-max and setting spec-draft-p-min improved throughput materially, which points to draft quality and acceptance being the bottleneck. The coding prompt produced about 95% acceptance, which is why the MTP variant pulled ahead much more clearly there. The 27B numbers are the most useful data point here: 17.44 tps non-MTP versus 27.70 tps MTP is a meaningful improvement for local inference on Apple silicon. The takeaway for other users is to benchmark by workload, not just by model name, because prose and coding prompts can behave very differently with MTP.
DISCOVERED
1h ago
2026-05-24
PUBLISHED
8h ago
2026-05-24
RELEVANCE
AUTHOR
chimph