OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Qwen3.6 quants beat smaller VRAM bets
On a 3070 8GB + 64GB DDR4 setup, the author found that a larger Q4 GGUF ran faster than a smaller Q4, and that Q5_K_S gave the best speed-quality balance. The takeaway is that for this MoE model, the fastest usable quant may not be the smallest one you can fit.
// ANALYSIS
Bigger quants can be the better local-inference choice once you’re memory-constrained, especially on MoE models where stability and runtime behavior matter as much as raw file size.
- –The smaller IQ4_XS variant hit looping issues during thinking, while the larger Q4_K_XL reportedly ran faster and more reliably
- –Throughput stayed strong even at long context, which suggests the real bottleneck is not just model size but how the quant interacts with the runtime and memory system
- –On hybrid CPU/GPU setups, a slightly larger quant can reduce pathological behavior and still improve end-to-end latency
- –Q5_K_S looks like the pragmatic pick here: close to the faster Q4 in speed, with better quality and more predictable outputs
- –This is a useful reminder for local LLM users to benchmark beyond “fits in VRAM” and test actual tokens/sec plus output stability
// TAGS
qwen3.6-35b-a3bllminferencegpubenchmarkopen-weights
DISCOVERED
3h ago
2026-04-25
PUBLISHED
5h ago
2026-04-24
RELEVANCE
9/ 10
AUTHOR
jeremynsl