OPEN_SOURCE ↗
REDDIT · REDDIT// 32d agoINFRASTRUCTURE
Qwen3.5 inference speeds spark LocalLLaMA debate
A LocalLLaMA thread asks why Qwen3.5-9B(p1) can run nearly as slowly as Qwen3.5-35B-A3B(p2) on consumer AMD hardware. The replies point to a bad apples-to-apples comparison: much longer prompt length, different context pressure, dense-vs-sparse activation behavior, and likely VRAM spillover.
// ANALYSIS
This is a good reminder that local LLM speed is usually a systems problem before it is a model-size problem.
- –The key reply notes the 9B run used a roughly 4.7k-token prompt versus about 1.6k for the 35B-A3B run, which can heavily distort throughput comparisons
- –The “A3B” suffix matters because the 35B mixture-of-experts variant only activates about 3B parameters, so a bigger headline size does not automatically mean slower inference
- –Community suggestions point to memory fit as the real bottleneck: once weights or KV cache spill beyond VRAM, a smaller dense model can feel unexpectedly sluggish
- –The thread also surfaces practical tuning advice for local serving stacks, including AWQ quantization, vLLM, and enabling MTP where supported
// TAGS
qwen3-5llminferencegpuopen-weights
DISCOVERED
32d ago
2026-03-11
PUBLISHED
32d ago
2026-03-11
RELEVANCE
7/ 10
AUTHOR
BitOk4326