BACK_TO_FEEDAICRIER_2
Qwen3.6 27B speeds split local users
OPEN_SOURCE ↗
REDDIT · REDDIT// 2h agoBENCHMARK RESULT

Qwen3.6 27B speeds split local users

LocalLLaMA users are comparing Qwen3.6-27B throughput after one llama.cpp setup reported about 13 tokens/sec on Q8_0 with 128K context across mixed RTX 2060 Super and 5060 Ti GPUs. The thread shows a wide spread, from similar llama.cpp results on consumer cards to much higher numbers on RTX 5090 and vLLM/MTP setups.

// ANALYSIS

The useful signal here is not that one rig is “slow,” but that Qwen3.6-27B makes inference-stack choices brutally visible: context length, quant, KV cache, split mode, and serving engine all matter.

  • Qwen3.6-27B is a 27B open-weight model with vision support and a native 262K context window, so 128K context plus Q8 cache is a heavy local-serving configuration.
  • llama.cpp users are reporting materially different speeds depending on GPU mix, quantization, flash attention, split strategy, and whether the model stays fully in VRAM.
  • vLLM and MTP-backed runs appear much faster in the thread, reinforcing Qwen’s own guidance that production throughput favors engines like vLLM, SGLang, or KTransformers.
  • For developers, the practical takeaway is to benchmark at the actual context length and modality settings they plan to use, not just compare headline tokens/sec.
// TAGS
qwen3.6-27bllminferencegpuopen-weightsself-hostedbenchmark

DISCOVERED

2h ago

2026-04-22

PUBLISHED

5h ago

2026-04-22

RELEVANCE

8/ 10

AUTHOR

Ambitious_Fold_2874