REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Qwen3.6 MTP graft speeds llama.cpp

This Hugging Face release grafts Qwen3.6-27B MTP draft heads onto Unsloth UD XL GGUF quantizations and pairs them with an unmerged llama.cpp speculative-decoding patch. The author reports roughly 2.5x token throughput with low VRAM overhead from keeping the MTP layers in Q8.

// ANALYSIS

This is a strong proof-of-concept for turning a model's built-in MTP training into real local inference speed, not just another speculative-decoding experiment.

–Qwen3 was trained with 3 MTP steps, so the draft heads are aligned with the base model instead of being a generic proxy drafter.
–Keeping the 3 MTP layers at Q8_0 while the base stays low-bit is a sensible tradeoff: small memory cost, better draft acceptance.
–The main limiter is ecosystem support, since it still depends on an unmerged llama.cpp PR rather than stock mainline builds.
–If the reported acceptance rate holds across prompts, this is one of the more practical ways to make 27B-class local models feel much faster.
–The release is also useful as a recipe: it includes the grafting script, raw layer extraction, and build steps for reproducing the setup on other models.

// TAGS

qwen3.6-27b-mtp-ud-ggufqwen3.6-27bllmopen-weightsquantizationinferencebenchmarkself-hosted

DISCOVERED

3h ago

2026-05-06

PUBLISHED

3h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

havenoammo