Qwen3.6 MTP graft speeds llama.cpp
This Hugging Face release grafts Qwen3.6-27B MTP draft heads onto Unsloth UD XL GGUF quantizations and pairs them with an unmerged llama.cpp speculative-decoding patch. The author reports roughly 2.5x token throughput with low VRAM overhead from keeping the MTP layers in Q8.
This is a strong proof-of-concept for turning a model's built-in MTP training into real local inference speed, not just another speculative-decoding experiment.
- –Qwen3 was trained with 3 MTP steps, so the draft heads are aligned with the base model instead of being a generic proxy drafter.
- –Keeping the 3 MTP layers at Q8_0 while the base stays low-bit is a sensible tradeoff: small memory cost, better draft acceptance.
- –The main limiter is ecosystem support, since it still depends on an unmerged llama.cpp PR rather than stock mainline builds.
- –If the reported acceptance rate holds across prompts, this is one of the more practical ways to make 27B-class local models feel much faster.
- –The release is also useful as a recipe: it includes the grafting script, raw layer extraction, and build steps for reproducing the setup on other models.
DISCOVERED
46d ago
2026-05-06
PUBLISHED
46d ago
2026-05-06
RELEVANCE
AUTHOR
havenoammo