Qwen3 NextN MTP guide for llama.cpp
A hands-on Reddit walkthrough shows how to run Qwen3.5 and Qwen3.6 NextN MTP speculative decoding in a custom llama.cpp build on a single RTX 3090 Ti. The recipe depends on forked PRs, MTP-aware GGUFs, and careful quantization plus KV settings to preserve the speedups.
This is the kind of guide power users want and most people will skip: useful, specific, and heavily dependent on a non-upstream branch plus model-file hygiene.
- –The real trick is not just enabling speculative decode, it is keeping the `nextn` tensors intact through quantization so the head still works
- –The performance story is hardware- and tuning-sensitive, with the post’s claims tied to a 3090 Ti, power/clock locks, and q8 KV cache settings
- –It is most relevant to people already running local inference stacks and willing to recompile, re-quantize, and troubleshoot edge cases
- –The guide also signals where llama.cpp is heading: MTP support looks promising, but the ergonomic path is still fork-first rather than turnkey upstream
- –For broader users, this reads more like an advanced operator note than a mainstream release announcement
DISCOVERED
2h ago
2026-05-07
PUBLISHED
4h ago
2026-05-07
RELEVANCE
AUTHOR
yes_i_tried_google