Qwen3.6 MTP tops 249 t/s on RTX 5090M
A Reddit benchmark claims the unsloth Qwen3.6-35B-A3B-MTP-GGUF UD-Q3_K_XL quant reaches 249.30 tokens/s on a laptop-class RTX 5090M with llama.cpp master, draft MTP, and spec-draft-n-max 3. The author compares it with the dense 27B variant on the same hardware, reports 74.28 tokens/s for the dense model, and includes a context-length sweep plus VRAM estimates up to 262K.
This is mainly a benchmark result rather than a model launch. The throughput gap is plausibly driven by MoE sparsity plus speculative decoding, and the reported 86.6% acceptance at n_max=3 is the key driver behind the throughput claim. The context sweep suggests the stack stays stable through 262K context with only a small drop at full native length.
DISCOVERED
17d ago
2026-05-23
PUBLISHED
17d ago
2026-05-23
RELEVANCE
AUTHOR
aurelienams