OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Qwen3.6 27B MTP hits 54 t/s
A Reddit user reports 29-30 tok/s without MTP and 54-55 tok/s with am17an's llama.cpp MTP branch on a V100 32GB SXM card, using q8_0 KV cache and a 200k context limit. The setup was used like a VS Code copilot and still stayed useful after the speed dropped at longer contexts.
// ANALYSIS
This is a strong reminder that speculative-style decoding can matter more than raw quantization tweaks when you're trying to make a 27B model feel interactive on older datacenter GPUs.
- –The reported jump from ~30 t/s to ~55 t/s is big enough to change the ergonomics of local coding workflows, not just benchmark bragging rights
- –The setup still degrades to 40-45 t/s past 50k tokens, so long-context behavior remains a real constraint even when the headline speed looks excellent
- –The user saw solid behavior on tool calls, sub-agents, and code review/refactor tasks, which is the right test for a coding model
- –This is still a single-user report, so reproducibility will hinge on the branch maturity, prompt mix, and whether the MTP GGUF is as stable across workloads as it looks here
- –The result also shows V100-era hardware can still be surprisingly competitive for local inference when the stack is tuned well
// TAGS
llmopen-weightsinferencegpulong-contextcoding-agentqwen3.6-27b-mtp
DISCOVERED
3h ago
2026-05-06
PUBLISHED
5h ago
2026-05-06
RELEVANCE
8/ 10
AUTHOR
m94301