OPEN_SOURCE ↗
REDDIT · REDDIT// 14d agoINFRASTRUCTURE
Qwen3.5-27B Slows Despite Dense Design
A LocalLLaMA user is seeing only ~30 tok/s from Qwen3.5-27B on OpenRouter, even via the fastest listed provider. For a dense 27B model, that is less a surprise than a serving problem: VRAM fit, quantization, batching, and prompt prefill all matter, and the posted TTFT spikes suggest queueing is hurting more than raw decode speed.
// ANALYSIS
30 tok/s is not the real headline here; the 30-95 second TTFT is.
- –Qwen3.5-27B is dense, so every generated token uses all 27B parameters. MoE models with far fewer active parameters can be dramatically faster at the same nominal size.
- –Qwen’s docs say Qwen3.5 defaults to thinking mode, which can add hidden reasoning work before the visible answer unless the provider disables it.
- –OpenRouter’s provider table is routed serving, not a clean single-GPU benchmark, so batching, queue depth, prompt length, and CPU offload can swing throughput a lot.
- –TTFT means time to first token, so those long waits include prompt processing and queueing as well as model compute.
- –On consumer cards, a 27B model often does not fit comfortably at useful precision, so once layers spill out of VRAM, token speed drops fast.
// TAGS
qwen3-5-27bllminferencegpubenchmarkopen-weights
DISCOVERED
14d ago
2026-03-29
PUBLISHED
14d ago
2026-03-28
RELEVANCE
8/ 10
AUTHOR
Deep_Row_8729