BACK_TO_FEEDAICRIER_2
Qwen3.6-27B tok/s claims miss old CPUs
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoBENCHMARK RESULT

Qwen3.6-27B tok/s claims miss old CPUs

A user says Qwen3.6-27B runs far slower than the tok/s numbers they see online, even with everything loaded into VRAM on a 3090 Ti. They report about 10 tok/s in llama.cpp and 18-19 tok/s in ik_llama.cpp at 50k context, then ask whether the slowdown is really caused by the model’s hybrid architecture and an older i9-9900K, or whether the CPU-bottleneck explanation is overstated.

// ANALYSIS

Hot take: the explanation is directionally plausible, but it is too absolute.

  • The official Qwen3.6-27B model card describes a hybrid `Gated DeltaNet + Gated Attention` layout, so it is not a plain dense transformer with a trivial all-GPU decode path.
  • `ik_llama.cpp` documents a faster `HAVE_FANCY_SIMD` path tied to AVX-VNNI/AVX-512-style support; Intel’s i9-9900K spec lists AVX2, not AVX-512 or VNNI.
  • That makes it believable that an older Coffee Lake CPU can bottleneck hybrid inference, especially in a backend that keeps part of the compute on the host.
  • The big comparison trap is context length: 50k context is far harsher than the short-context runs people often post online.
  • The higher numbers on Reddit are likely from a different mix of variables: shorter prompts, speculative decoding, different quantizations, newer CPUs, or a backend that avoids the same CPU-side work.
  • So this is probably not “gaslighting,” but it is almost certainly an apples-to-oranges benchmark comparison.
// TAGS
qwen3-6-27bqwenllama.cppik_llama.cppbenchmarktok/savx2avx-vnnilong-contextgpu-inference

DISCOVERED

5h ago

2026-04-30

PUBLISHED

6h ago

2026-04-30

RELEVANCE

9/ 10

AUTHOR

YourNightmar31