BACK_TO_FEEDAICRIER_2
Qwen3.6-27B MTP boosts short prompts, drags at 70k+
OPEN_SOURCE ↗
REDDIT · REDDIT// 22h agoBENCHMARK RESULT

Qwen3.6-27B MTP boosts short prompts, drags at 70k+

A user reports a throughput tradeoff when serving Qwen3.6-27B with vLLM on four RTX 3090s: MTP speculative decoding reaches 48-50 tokens/sec on short contexts, but drops to 15-20 tokens/sec once prompts exceed roughly 70-80k tokens. Without MTP, the same setup starts around 30 tokens/sec and only degrades to 26-27 tokens/sec at large context, making the baseline path more attractive for long-context agentic coding workloads.

// ANALYSIS

The hot take is that MTP is helping the easy part of the workload and hurting the hard part; for long-context serving, speculative decoding can become the bottleneck instead of the accelerator.

  • The reported config is aggressive: `--max-model-len 262144`, chunked prefill, prefix caching, and `--speculative-config {"method":"mtp","num_speculative_tokens":3}` all stack pressure onto memory bandwidth and scheduler overhead.
  • The pattern suggests the MTP draft path is paying a larger per-step tax as the KV cache grows, so the speedup vanishes once the prompt gets very long.
  • For agentic coding, where long contexts are routine, a lower speculative token count or disabling MTP entirely is likely the pragmatic choice.
  • The post is useful less as a bug report and more as a serving benchmark: it shows that “faster on short prompts” does not automatically mean “faster in real long-context workloads.”
// TAGS
qwenvllmmtpspeculative-decodinglong-contextlocal-firsttensor-parallelagentic-coding

DISCOVERED

22h ago

2026-05-02

PUBLISHED

1d ago

2026-05-02

RELEVANCE

7/ 10

AUTHOR

niellsro