Qwen3.6 27B speeds split local users

// 90d agoBENCHMARK RESULT

Qwen3.6 27B speeds split local users

LocalLLaMA users are comparing Qwen3.6-27B throughput after one llama.cpp setup reported about 13 tokens/sec on Q8_0 with 128K context across mixed RTX 2060 Super and 5060 Ti GPUs. The thread shows a wide spread, from similar llama.cpp results on consumer cards to much higher numbers on RTX 5090 and vLLM/MTP setups.

// ANALYSIS

The useful signal here is not that one rig is “slow,” but that Qwen3.6-27B makes inference-stack choices brutally visible: context length, quant, KV cache, split mode, and serving engine all matter.

–Qwen3.6-27B is a 27B open-weight model with vision support and a native 262K context window, so 128K context plus Q8 cache is a heavy local-serving configuration.
–llama.cpp users are reporting materially different speeds depending on GPU mix, quantization, flash attention, split strategy, and whether the model stays fully in VRAM.
–vLLM and MTP-backed runs appear much faster in the thread, reinforcing Qwen’s own guidance that production throughput favors engines like vLLM, SGLang, or KTransformers.
–For developers, the practical takeaway is to benchmark at the actual context length and modality settings they plan to use, not just compare headline tokens/sec.

// TAGS

qwen3.6-27bllminferencegpuopen-weightsself-hostedbenchmark

DISCOVERED

90d ago

2026-04-22

PUBLISHED

90d ago

2026-04-22

RELEVANCE

8/ 10

AUTHOR

Ambitious_Fold_2874

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

NEWS1h ago

Anthropic Agrees to Record $1.5B Copyright Settlement

Anthropic has reached a landmark $1.5 billion settlement in a copyright lawsuit filed by authors who accused the company of using their books without permission to train its Claude AI models. The resolution marks one of the largest financial payouts in generative AI litigation to date, resolving major legal exposure for Anthropic while signaling intensified scrutiny around training data sourcing across the AI industry.

INFRA3h ago

NVIDIA Details Vera Rubin Agentic AI Architecture

NVIDIA unveiled its Vera Rubin architecture, marking a transition toward purpose-built systems for complex agentic AI reasoning rather than a conventional accelerator refresh. The full-stack platform integrates custom Vera CPUs, Rubin GPUs equipped with 288GB of HBM4 memory, and advanced NVLink 6 networking infrastructure to address key memory and communication bottlenecks in multi-step AI workflows.

INFRA3h ago

Meta builds Switchboard AI router to cut costs

Meta is building an internal AI model routing system named Switchboard to curb escalating inference costs across its AI services. Developed within Meta's AAI Labs incubator, it evaluates prompt complexity to route routine tasks to smaller, lower-cost models while preserving frontier models for complex requests.