BACK_TO_FEEDAICRIER_2
Qwen3.6 MTP graft speeds llama.cpp
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Qwen3.6 MTP graft speeds llama.cpp

This Hugging Face release grafts Qwen3.6-27B MTP draft heads onto Unsloth UD XL GGUF quantizations and pairs them with an unmerged llama.cpp speculative-decoding patch. The author reports roughly 2.5x token throughput with low VRAM overhead from keeping the MTP layers in Q8.

// ANALYSIS

This is a strong proof-of-concept for turning a model's built-in MTP training into real local inference speed, not just another speculative-decoding experiment.

  • Qwen3 was trained with 3 MTP steps, so the draft heads are aligned with the base model instead of being a generic proxy drafter.
  • Keeping the 3 MTP layers at Q8_0 while the base stays low-bit is a sensible tradeoff: small memory cost, better draft acceptance.
  • The main limiter is ecosystem support, since it still depends on an unmerged llama.cpp PR rather than stock mainline builds.
  • If the reported acceptance rate holds across prompts, this is one of the more practical ways to make 27B-class local models feel much faster.
  • The release is also useful as a recipe: it includes the grafting script, raw layer extraction, and build steps for reproducing the setup on other models.
// TAGS
qwen3.6-27b-mtp-ud-ggufqwen3.6-27bllmopen-weightsquantizationinferencebenchmarkself-hosted

DISCOVERED

3h ago

2026-05-06

PUBLISHED

3h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

havenoammo