OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoBENCHMARK RESULT
Qwen3.6-27B tok/s claims miss old CPUs
A user says Qwen3.6-27B runs far slower than the tok/s numbers they see online, even with everything loaded into VRAM on a 3090 Ti. They report about 10 tok/s in llama.cpp and 18-19 tok/s in ik_llama.cpp at 50k context, then ask whether the slowdown is really caused by the model’s hybrid architecture and an older i9-9900K, or whether the CPU-bottleneck explanation is overstated.
// ANALYSIS
Hot take: the explanation is directionally plausible, but it is too absolute.
- –The official Qwen3.6-27B model card describes a hybrid `Gated DeltaNet + Gated Attention` layout, so it is not a plain dense transformer with a trivial all-GPU decode path.
- –`ik_llama.cpp` documents a faster `HAVE_FANCY_SIMD` path tied to AVX-VNNI/AVX-512-style support; Intel’s i9-9900K spec lists AVX2, not AVX-512 or VNNI.
- –That makes it believable that an older Coffee Lake CPU can bottleneck hybrid inference, especially in a backend that keeps part of the compute on the host.
- –The big comparison trap is context length: 50k context is far harsher than the short-context runs people often post online.
- –The higher numbers on Reddit are likely from a different mix of variables: shorter prompts, speculative decoding, different quantizations, newer CPUs, or a backend that avoids the same CPU-side work.
- –So this is probably not “gaslighting,” but it is almost certainly an apples-to-oranges benchmark comparison.
// TAGS
qwen3-6-27bqwenllama.cppik_llama.cppbenchmarktok/savx2avx-vnnilong-contextgpu-inference
DISCOVERED
5h ago
2026-04-30
PUBLISHED
6h ago
2026-04-30
RELEVANCE
9/ 10
AUTHOR
YourNightmar31