Qwen3.6-27B MTP Hits 2x on Mi50s
On dual AMD Mi50s, a grafted MTP setup on Qwen3.6-27B GGUF pushed llama.cpp from roughly 26 tok/s to about 40 tok/s on short prompts, and to nearly 48 tok/s when tensor parallelism was combined with MTP. The author says the gains shrink on long prompts because prefill slows down, but the full coding run still came out close to 2x faster than stock.
This is the kind of benchmark that matters for people trying to keep older ROCm hardware relevant: the speedup is real, but it is workload-sensitive and not free. The headline numbers look great, yet the prefill regression means you should treat MTP as a decode accelerator, not a universal throughput win. Grafting MTP onto an existing Q4_1 quant lowers the barrier for people who already have local GGUF workflows and older AMD cards. The biggest gains show up in short, decode-heavy workloads; the 18k-token prompt shows the real-world win is smaller than the short-benchmark peak. Tensor parallelism appears to do a lot of the heavy lifting, with MTP adding another layer of improvement on top. For local AI builders, this is a useful sign that llama.cpp’s ROCm path is getting more competitive even on aging GPUs like the Mi50. The current caveat is important: prefill performance regressed, so anyone deploying this needs to benchmark against their own prompt mix before drawing conclusions.
DISCOVERED
2h ago
2026-05-09
PUBLISHED
4h ago
2026-05-09
RELEVANCE
AUTHOR
legit_split_