OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoTUTORIAL
Qwen3.5-35B-A3B hits 13.2 t/s on 8GB
A Reddit tutorial shows how to run Qwen3.5-35B-A3B on an 8GB RTX 5070 Laptop GPU with llama-cli, Vulkan, and Unsloth’s IQ3_XXS GGUF, reaching 13.2 tokens per second on generation. It’s a practical local-inference recipe for squeezing a large open-weight MoE model onto consumer hardware.
// ANALYSIS
This is the kind of post that matters more than another benchmark chart: it turns a “too big for laptop” model into something actually usable, but only by leaning hard on quantization, offload tuning, and conservative context settings.
- –The official model card says Qwen3.5-35B-A3B has 35B total parameters, 3B activated, and 262,144 native context, so the model is explicitly built for efficiency rather than brute-force density.
- –The reported speed depends on a narrow setup: IQ3_XXS quantization, `-ngl 18`, 6 threads, Vulkan, and an 8K context. Any of those knobs moving the wrong way will likely drag throughput down fast.
- –The Unsloth imatrix quant is the real unlock here; this is less “35B on 8GB in general” and more “the right quant makes a very specific local deployment possible.”
- –For local-LLM tinkerers, the takeaway is that backend and quant choice matter as much as model choice. On laptops, inference engineering is the product.
- –Community replies already point at the obvious next experiments: let auto-fit handle more of the layer placement, lower context, or accept that smaller Qwen3.5 variants will be much easier to scale.
// TAGS
qwen3-5-35b-a3bllminferencegpucliopen-weightsself-hosted
DISCOVERED
25d ago
2026-03-17
PUBLISHED
25d ago
2026-03-17
RELEVANCE
8/ 10
AUTHOR
zeta-pandey