OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Qwen3.6 MTP graft speeds llama.cpp
This Hugging Face release grafts Qwen3.6-27B MTP draft heads onto Unsloth UD XL GGUF quantizations and pairs them with an unmerged llama.cpp speculative-decoding patch. The author reports roughly 2.5x token throughput with low VRAM overhead from keeping the MTP layers in Q8.
// ANALYSIS
This is a strong proof-of-concept for turning a model's built-in MTP training into real local inference speed, not just another speculative-decoding experiment.
- –Qwen3 was trained with 3 MTP steps, so the draft heads are aligned with the base model instead of being a generic proxy drafter.
- –Keeping the 3 MTP layers at Q8_0 while the base stays low-bit is a sensible tradeoff: small memory cost, better draft acceptance.
- –The main limiter is ecosystem support, since it still depends on an unmerged llama.cpp PR rather than stock mainline builds.
- –If the reported acceptance rate holds across prompts, this is one of the more practical ways to make 27B-class local models feel much faster.
- –The release is also useful as a recipe: it includes the grafting script, raw layer extraction, and build steps for reproducing the setup on other models.
// TAGS
qwen3.6-27b-mtp-ud-ggufqwen3.6-27bllmopen-weightsquantizationinferencebenchmarkself-hosted
DISCOVERED
3h ago
2026-05-06
PUBLISHED
3h ago
2026-05-06
RELEVANCE
8/ 10
AUTHOR
havenoammo