OPEN_SOURCE ↗
REDDIT · REDDIT// 2h agoBENCHMARK RESULT
Qwen3.6 27B speeds split local users
LocalLLaMA users are comparing Qwen3.6-27B throughput after one llama.cpp setup reported about 13 tokens/sec on Q8_0 with 128K context across mixed RTX 2060 Super and 5060 Ti GPUs. The thread shows a wide spread, from similar llama.cpp results on consumer cards to much higher numbers on RTX 5090 and vLLM/MTP setups.
// ANALYSIS
The useful signal here is not that one rig is “slow,” but that Qwen3.6-27B makes inference-stack choices brutally visible: context length, quant, KV cache, split mode, and serving engine all matter.
- –Qwen3.6-27B is a 27B open-weight model with vision support and a native 262K context window, so 128K context plus Q8 cache is a heavy local-serving configuration.
- –llama.cpp users are reporting materially different speeds depending on GPU mix, quantization, flash attention, split strategy, and whether the model stays fully in VRAM.
- –vLLM and MTP-backed runs appear much faster in the thread, reinforcing Qwen’s own guidance that production throughput favors engines like vLLM, SGLang, or KTransformers.
- –For developers, the practical takeaway is to benchmark at the actual context length and modality settings they plan to use, not just compare headline tokens/sec.
// TAGS
qwen3.6-27bllminferencegpuopen-weightsself-hostedbenchmark
DISCOVERED
2h ago
2026-04-22
PUBLISHED
5h ago
2026-04-22
RELEVANCE
8/ 10
AUTHOR
Ambitious_Fold_2874