Qwen3.5 9B hits 20 tps locally
A Reddit user says this Qwen3.5 9B reasoning-distilled model runs at 20 tokens per second on an RX 580 laptop with just 8GB of RAM and swap. The post is less a formal benchmark than a proof that aggressive distillation and quantization can make strong local inference surprisingly accessible.
The interesting part here is not “state of the art” hype, it’s the hardware economics: a 9B reasoning model is usable on commodity, heavily constrained gear when the stack is tuned hard enough.
- –Shows how far local inference has moved with smaller, distilled reasoning models and practical quantization
- –The reported throughput is meaningful for hobbyist agents, MCP experiments, and offline workflows, even if it is not a controlled benchmark
- –The setup highlights the tradeoffs: PCIe x4, USB-attached NVMe, and system swap all signal a fragile but functional performance envelope
- –Useful signal for low-VRAM users deciding whether 9B-class models are the sweet spot for local reasoning
DISCOVERED
58d ago
2026-03-31
PUBLISHED
58d ago
2026-03-31
RELEVANCE
AUTHOR
ItzYaBoiGoogle