YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Smaller GGUF quants run slower on Qwen3.6

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Smaller GGUF quants run slower on Qwen3.6
OPEN LINK ↗
// 45d agoBENCHMARK RESULT

Smaller GGUF quants run slower on Qwen3.6

A Reddit user running Qwen3.6-35B-A3B in LM Studio and llama.cpp on a 3080 10GB plus Ryzen 5 3600 reports the counterintuitive result that Q4_K_XL is much faster than IQ_4_XS at the same settings, even though the IQ_4_XS file is smaller. The post asks why a lower-bitrate GGUF quant would deliver roughly half the tokens per second, and whether the bottleneck is the quantization format, GPU offload split, or MoE handling.

// ANALYSIS

Hot take: smaller file size is not the same thing as faster inference, especially when the quant format changes and the workload is a sparse MoE model.

  • IQ_4_XS is an i-quant format, which uses a more complex importance-matrix-based scheme than standard K-quants; that can add dequantization overhead and hit less-optimized kernels in current llama.cpp builds.
  • Q4_K_XL may simply have better backend support and more efficient matmul paths, so it can outperform a “smaller” quant on real hardware.
  • With a 10GB 3080 and mixed CPU/GPU offload, throughput can be dominated by kernel efficiency, CPU-GPU traffic, and KV cache pressure rather than raw model file size.
  • For sparse MoE models, routing and expert placement can make performance especially non-intuitive; reducing bytes on disk does not guarantee fewer stalls during token generation.
  • The likely fix is to benchmark multiple quant families, not just smaller-vs-larger within one family, and to verify the exact llama.cpp / LM Studio build because quant-speed regressions are version-sensitive.
// TAGS
qwenquantizationllama-cpplm-studiomoelocal-firstperformance

DISCOVERED

45d ago

2026-05-06

PUBLISHED

45d ago

2026-05-05

RELEVANCE

6/ 10

AUTHOR

quickreactor