OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoBENCHMARK RESULT
Qwen3.6-27B MTP tuning hits 50 t/s
A Reddit benchmark shows Qwen3.6-27B can reach about 50 tokens/sec on a 3090 when run as the RDson MTP GGUF in a tuned llama.cpp am17an build. The recipe leans on `--spec-type mtp`, `--spec-draft-n-max 2`, flash attention, and a 100k context cap.
// ANALYSIS
This is a deployment win, not just a model win: the same 27B checkpoint becomes much more usable once MTP survives quantization and the inference stack actually exploits it.
- –The speedup depends on an MTP-aware GGUF; plain GGUF conversions usually give up that headroom.
- –The posted config suggests 100k context is the practical sweet spot on a 3090, not maximum possible context.
- –`--spec-draft-n-max 2` looks like the stable setting here; draft 3 was reportedly too heavy at higher context.
- –`q4_0` KV cache plus large batch and ubatch sizes show this is tuned for throughput, not minimum memory.
- –The broader lesson is that long-context local agents may be better served by "enough" context plus compaction than by brute-forcing huge windows.
// TAGS
llmquantizationinferencegpulong-contextopen-sourceqwen3-6-27b
DISCOVERED
4h ago
2026-05-07
PUBLISHED
4h ago
2026-05-06
RELEVANCE
9/ 10
AUTHOR
admajic