REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Qwen3.6 27B MTP hits 54 t/s

A Reddit user reports 29-30 tok/s without MTP and 54-55 tok/s with am17an's llama.cpp MTP branch on a V100 32GB SXM card, using q8_0 KV cache and a 200k context limit. The setup was used like a VS Code copilot and still stayed useful after the speed dropped at longer contexts.

// ANALYSIS

This is a strong reminder that speculative-style decoding can matter more than raw quantization tweaks when you're trying to make a 27B model feel interactive on older datacenter GPUs.

–The reported jump from ~30 t/s to ~55 t/s is big enough to change the ergonomics of local coding workflows, not just benchmark bragging rights
–The setup still degrades to 40-45 t/s past 50k tokens, so long-context behavior remains a real constraint even when the headline speed looks excellent
–The user saw solid behavior on tool calls, sub-agents, and code review/refactor tasks, which is the right test for a coding model
–This is still a single-user report, so reproducibility will hinge on the branch maturity, prompt mix, and whether the MTP GGUF is as stable across workloads as it looks here
–The result also shows V100-era hardware can still be surprisingly competitive for local inference when the stack is tuned well

// TAGS

llmopen-weightsinferencegpulong-contextcoding-agentqwen3.6-27b-mtp

DISCOVERED

3h ago

2026-05-06

PUBLISHED

5h ago

2026-05-06

RELEVANCE

8/ 10

AUTHOR

m94301