Qwen3 NextN MTP guide for llama.cpp

REDDIT · REDDIT// 2h agoTUTORIAL

Qwen3 NextN MTP guide for llama.cpp

A hands-on Reddit walkthrough shows how to run Qwen3.5 and Qwen3.6 NextN MTP speculative decoding in a custom llama.cpp build on a single RTX 3090 Ti. The recipe depends on forked PRs, MTP-aware GGUFs, and careful quantization plus KV settings to preserve the speedups.

// ANALYSIS

This is the kind of guide power users want and most people will skip: useful, specific, and heavily dependent on a non-upstream branch plus model-file hygiene.

–The real trick is not just enabling speculative decode, it is keeping the `nextn` tensors intact through quantization so the head still works
–The performance story is hardware- and tuning-sensitive, with the post’s claims tied to a 3090 Ti, power/clock locks, and q8 KV cache settings
–It is most relevant to people already running local inference stacks and willing to recompile, re-quantize, and troubleshoot edge cases
–The guide also signals where llama.cpp is heading: MTP support looks promising, but the ergonomic path is still fork-first rather than turnkey upstream
–For broader users, this reads more like an advanced operator note than a mainstream release announcement

// TAGS

llminferencequantizationgpuopen-sourceself-hostedllama-cpp

DISCOVERED

2h ago

2026-05-07

PUBLISHED

4h ago

2026-05-07

RELEVANCE

8/ 10

AUTHOR

yes_i_tried_google

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

UPDATE38m ago

Amp rebuild hits scaling snags

Amp is rolling out a rebuilt CLI with remote control, auto-compaction, plugins, and a faster architecture. The team is already asking users to fall back to the old Amp while it works through scaling issues in the new stack.

POLICY54m ago

CAISI expands pre-release AI testing

The Commerce Department’s CAISI signed new agreements with Google DeepMind, Microsoft, and xAI to evaluate frontier models before public release. The program expands government access to unreleased systems for pre-deployment testing, post-deployment assessment, and security research.

TUTORIAL1h ago

Better Stack MCP brings debugging to Claude Code

Better Stack’s MCP server connects Claude Code and other assistants to uptime, telemetry, and error-tracking data. The demo shows error details pulled into the terminal so a real bug can be fixed without bouncing into the Better Stack UI.