MTP speeds coding, drags creative output
The author ran 300+ local benchmarks on Qwen3.6-27B with MTP enabled and found the payoff is almost entirely workload-dependent. Coding and factual prompts benefit, while low-quant creative writing can slow down because draft acceptance drops and speculative overhead wins.
Speculative decoding is not a universal speed upgrade; it is a workload-sensitive optimization that only pays when the model can reliably predict the next tokens.
- –Coding is the sweet spot: draft acceptance stayed high enough that MTP delivered large gains across every quant tested.
- –Creative writing is the failure case on cheaper quants, where lower acceptance rates make the extra draft pass more expensive than the tokens it saves.
- –Memory bandwidth still matters, but it is secondary to task predictability here; F16 and Q8_0 win big because they are more bandwidth-bound and can better amortize the draft overhead.
- –The finding that `num_speculative_tokens=3` is usually optimal is practical, not just academic: more drafts hurt once acceptance starts sliding.
- –The takeaway for local inference is simple: benchmark on your real workload, not on a generic chat loop, because task mix changes the conclusion more than quant size does.
DISCOVERED
1h ago
2026-05-10
PUBLISHED
2h ago
2026-05-10
RELEVANCE
AUTHOR
ex-arman68