OPEN_SOURCE ↗
REDDIT · REDDIT// 24d agoTUTORIAL
Qwen3.5 GPTQ benchmarks on 4 Radeon GPUs
A Reddit guide shows Qwen3.5-122B-A10B-GPTQ-Int4 running on four Radeon AI PRO R9700s under vLLM ROCm with a working long-context config. On a real 41k-context task, it reported 34.9 s TTFT, 101.7 s total time, and roughly 4,150 tok/s prompt throughput.
// ANALYSIS
This is a strong local-inference win for AMD users: GPTQ Int4 plus vLLM ROCm makes a 122B-class model practical, and the prefill speedup over llama.cpp is the real headline. The catch is that this is still a tuning-heavy setup, not a clean victory lap.
- –Standard HF weights OOMed, so quantization was the enabler that made the deployment fit at the target context length.
- –The author had to disable Qwen’s thinking mode via `chat_template_kwargs`, which matters for latency and output control in real workflows.
- –Compared with llama.cpp, the throughput numbers are dramatically better on prefill, but the user still saw worse quality than Q5_K_XL on their task.
- –Thermal and power costs are non-trivial: all four GPUs ran hot at 90C+, and idle draw was about 90 W per GPU.
- –The post also hints that ROCm/MoE support still has room to mature, so future driver and runtime improvements could move the ceiling higher.
// TAGS
qwen3.5-122b-a10b-gptq-int4vllmrocmllmgpuinferencebenchmarkself-hosted
DISCOVERED
24d ago
2026-03-18
PUBLISHED
24d ago
2026-03-18
RELEVANCE
8/ 10
AUTHOR
grunt_monkey_