OPEN_SOURCE ↗
REDDIT · REDDIT// 5d agoBENCHMARK RESULT
Qwen3.5-35B Matches Q4 on MI50s
On dual AMD MI50s, a community benchmark says Qwen3.5-35B-A3B at Q8_0 hits 55 tok/s generation and 1100 tok/s prefill, nearly matching a Q4_K_XL run. The post suggests older AMD hardware and software overhead are flattening the expected speedup from heavier quantization.
// ANALYSIS
This reads less like a surprise model win and more like a reminder that local inference performance is often limited by kernels, memory movement, and device topology, not just bit width.
- –Q8_0 keeping pace with Q4_K_XL on generation suggests the bottleneck is not purely arithmetic throughput.
- –The prefill jump on two GPUs shows where parallelism still matters: prompt processing benefits far more than token-by-token decoding.
- –MI50-era AMD cards are exactly where inference stacks tend to be least polished, so quantization gains can get swallowed by software inefficiency.
- –For local model runners, this is a useful reminder to benchmark `prefill` and `decode` separately before choosing a quant level.
- –Qwen3.5-35B-A3B still looks attractive for multi-GPU local deployment, especially if you can afford a higher-quality quant without losing real-world speed.
// TAGS
qwen3.5-35b-a3bllmbenchmarkgpuinferenceopen-source
DISCOVERED
5d ago
2026-04-06
PUBLISHED
5d ago
2026-04-06
RELEVANCE
8/ 10
AUTHOR
Far-Low-4705