BACK_TO_FEEDAICRIER_2
Qwen3.5-35B-A3B hits 13.2 t/s on 8GB
OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoTUTORIAL

Qwen3.5-35B-A3B hits 13.2 t/s on 8GB

A Reddit tutorial shows how to run Qwen3.5-35B-A3B on an 8GB RTX 5070 Laptop GPU with llama-cli, Vulkan, and Unsloth’s IQ3_XXS GGUF, reaching 13.2 tokens per second on generation. It’s a practical local-inference recipe for squeezing a large open-weight MoE model onto consumer hardware.

// ANALYSIS

This is the kind of post that matters more than another benchmark chart: it turns a “too big for laptop” model into something actually usable, but only by leaning hard on quantization, offload tuning, and conservative context settings.

  • The official model card says Qwen3.5-35B-A3B has 35B total parameters, 3B activated, and 262,144 native context, so the model is explicitly built for efficiency rather than brute-force density.
  • The reported speed depends on a narrow setup: IQ3_XXS quantization, `-ngl 18`, 6 threads, Vulkan, and an 8K context. Any of those knobs moving the wrong way will likely drag throughput down fast.
  • The Unsloth imatrix quant is the real unlock here; this is less “35B on 8GB in general” and more “the right quant makes a very specific local deployment possible.”
  • For local-LLM tinkerers, the takeaway is that backend and quant choice matter as much as model choice. On laptops, inference engineering is the product.
  • Community replies already point at the obvious next experiments: let auto-fit handle more of the layer placement, lower context, or accept that smaller Qwen3.5 variants will be much easier to scale.
// TAGS
qwen3-5-35b-a3bllminferencegpucliopen-weightsself-hosted

DISCOVERED

25d ago

2026-03-17

PUBLISHED

25d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

zeta-pandey