Qwen3.5 inference speeds spark LocalLLaMA debate
A LocalLLaMA thread asks why Qwen3.5-9B(p1) can run nearly as slowly as Qwen3.5-35B-A3B(p2) on consumer AMD hardware. The replies point to a bad apples-to-apples comparison: much longer prompt length, different context pressure, dense-vs-sparse activation behavior, and likely VRAM spillover.
This is a good reminder that local LLM speed is usually a systems problem before it is a model-size problem.
- –The key reply notes the 9B run used a roughly 4.7k-token prompt versus about 1.6k for the 35B-A3B run, which can heavily distort throughput comparisons
- –The “A3B” suffix matters because the 35B mixture-of-experts variant only activates about 3B parameters, so a bigger headline size does not automatically mean slower inference
- –Community suggestions point to memory fit as the real bottleneck: once weights or KV cache spill beyond VRAM, a smaller dense model can feel unexpectedly sluggish
- –The thread also surfaces practical tuning advice for local serving stacks, including AWQ quantization, vLLM, and enabling MTP where supported
DISCOVERED
79d ago
2026-03-11
PUBLISHED
79d ago
2026-03-11
RELEVANCE
AUTHOR
BitOk4326
