BACK_TO_FEEDAICRIER_2
Qwen3.6-27B MTP tuning hits 50 t/s
OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT

Qwen3.6-27B MTP tuning hits 50 t/s

A Reddit benchmark shows Qwen3.6-27B can reach about 50 tokens/sec on a 3090 when run as the RDson MTP GGUF in a tuned llama.cpp am17an build. The recipe leans on `--spec-type mtp`, `--spec-draft-n-max 2`, flash attention, and a 100k context cap.

// ANALYSIS

This is a deployment win, not just a model win: the same 27B checkpoint becomes much more usable once MTP survives quantization and the inference stack actually exploits it.

  • The speedup depends on an MTP-aware GGUF; plain GGUF conversions usually give up that headroom.
  • The posted config suggests 100k context is the practical sweet spot on a 3090, not maximum possible context.
  • `--spec-draft-n-max 2` looks like the stable setting here; draft 3 was reportedly too heavy at higher context.
  • `q4_0` KV cache plus large batch and ubatch sizes show this is tuned for throughput, not minimum memory.
  • The broader lesson is that long-context local agents may be better served by "enough" context plus compaction than by brute-forcing huge windows.
// TAGS
llmquantizationinferencegpulong-contextopen-sourceqwen3-6-27b

DISCOVERED

4h ago

2026-05-07

PUBLISHED

4h ago

2026-05-06

RELEVANCE

9/ 10

AUTHOR

admajic