OPEN_SOURCE ↗
REDDIT · REDDIT// 22d agoTUTORIAL
Qwen3.5-35B-A3B strains 16GB Radeon rigs
A LocalLLaMA user is trying to run a Huihui Qwen3.5-35B-A3B GGUF in llama.cpp on a 16GB RX 9070 XT and wants to know whether a larger quant is realistic. The thread is really about where the quality-vs-VRAM sweet spot lands for a sparse MoE model once context, KV cache, and AMD backend quirks enter the picture.
// ANALYSIS
This is the classic local-LLM tradeoff in miniature: Qwen3.5 looks small on paper because only a handful of experts are active per token, but the full model, cache, and long-context ambitions still make 16GB a tight fit.
- –The official model card recommends `temperature=0.7`, `top_p=0.8`, `top_k=20` for non-thinking general use, which is close to what the poster is already using
- –Qwen’s docs describe Qwen3.5 as a 36B-parameter sparse MoE with 262k native context, so memory pressure is driven by more than just the base quant
- –Community replies suggest `Q5_K_M` or `Q6_K` may be more reliable than `IQ4_XS` on AMD, but the right choice depends on how much headroom you leave for KV cache
- –For the poster’s stated use case, docs and email help, a conservative quant is probably the safer move than chasing a bigger file and starving the GPU
// TAGS
qwen3.5-35b-a3bllminferencegpuself-hostedopen-source
DISCOVERED
22d ago
2026-03-21
PUBLISHED
22d ago
2026-03-21
RELEVANCE
7/ 10
AUTHOR
uber-linny