Qwen3.6-27B MTP Hits 2x on Mi50s

// 2h agoBENCHMARK RESULT

Qwen3.6-27B MTP Hits 2x on Mi50s

On dual AMD Mi50s, a grafted MTP setup on Qwen3.6-27B GGUF pushed llama.cpp from roughly 26 tok/s to about 40 tok/s on short prompts, and to nearly 48 tok/s when tensor parallelism was combined with MTP. The author says the gains shrink on long prompts because prefill slows down, but the full coding run still came out close to 2x faster than stock.

// ANALYSIS

This is the kind of benchmark that matters for people trying to keep older ROCm hardware relevant: the speedup is real, but it is workload-sensitive and not free. The headline numbers look great, yet the prefill regression means you should treat MTP as a decode accelerator, not a universal throughput win. Grafting MTP onto an existing Q4_1 quant lowers the barrier for people who already have local GGUF workflows and older AMD cards. The biggest gains show up in short, decode-heavy workloads; the 18k-token prompt shows the real-world win is smaller than the short-benchmark peak. Tensor parallelism appears to do a lot of the heavy lifting, with MTP adding another layer of improvement on top. For local AI builders, this is a useful sign that llama.cpp’s ROCm path is getting more competitive even on aging GPUs like the Mi50. The current caveat is important: prefill performance regressed, so anyone deploying this needs to benchmark against their own prompt mix before drawing conclusions.

// TAGS

qwen3-6-27bllama.cppllmopen-weightsquantizationinferencegpubenchmark

DISCOVERED

2h ago

2026-05-09

PUBLISHED

4h ago

2026-05-09

RELEVANCE

8/ 10

AUTHOR

legit_split_

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE28m ago

Agentmemory ships benchmarked memory layer for copilots

agentmemory is an open-source persistent memory runtime for AI coding agents that captures prompts, tool calls, and session events, then consolidates them into searchable memory across Claude Code, Cursor, Gemini CLI, Codex CLI, OpenCode, and other MCP or HTTP clients. The project positions itself as a full memory engine rather than a library or vector store, with auto-capture hooks, hybrid retrieval over BM25 plus vector plus graph search, and a live viewer for replaying sessions. Its README highlights strong retrieval and token-efficiency claims, including 95.2% R@5 on LongMemEval-S and roughly 92% fewer tokens than pushing full context.

TUTORIAL1h ago

GPT-Realtime-2 brings voice control to CRM

OpenAI is showing GPT-Realtime-2 in a CRM voice-control workflow, framing the model as a practical way to let users speak actions instead of clicking through software. The demo highlights realtime tool use for enterprise workflows, not just conversational novelty.

OPEN SOURCE1h ago

OpenCode v1.14.42 adds LLM core, Scout

OpenCode v1.14.42 introduces a new internal schema-first LLM foundation, an interactive split-footer mode for `opencode run`, and Scout for repo research and dependency inspection. It also continues a broader server refactor, with Effect HTTP replacing more Hono-based paths and several provider/reasoning fixes.

Qwen3.6-27B MTP Hits 2x on Mi50s