BACK_TO_FEEDAICRIER_2
Qwen3.5 GPTQ benchmarks on 4 Radeon GPUs
OPEN_SOURCE ↗
REDDIT · REDDIT// 24d agoTUTORIAL

Qwen3.5 GPTQ benchmarks on 4 Radeon GPUs

A Reddit guide shows Qwen3.5-122B-A10B-GPTQ-Int4 running on four Radeon AI PRO R9700s under vLLM ROCm with a working long-context config. On a real 41k-context task, it reported 34.9 s TTFT, 101.7 s total time, and roughly 4,150 tok/s prompt throughput.

// ANALYSIS

This is a strong local-inference win for AMD users: GPTQ Int4 plus vLLM ROCm makes a 122B-class model practical, and the prefill speedup over llama.cpp is the real headline. The catch is that this is still a tuning-heavy setup, not a clean victory lap.

  • Standard HF weights OOMed, so quantization was the enabler that made the deployment fit at the target context length.
  • The author had to disable Qwen’s thinking mode via `chat_template_kwargs`, which matters for latency and output control in real workflows.
  • Compared with llama.cpp, the throughput numbers are dramatically better on prefill, but the user still saw worse quality than Q5_K_XL on their task.
  • Thermal and power costs are non-trivial: all four GPUs ran hot at 90C+, and idle draw was about 90 W per GPU.
  • The post also hints that ROCm/MoE support still has room to mature, so future driver and runtime improvements could move the ceiling higher.
// TAGS
qwen3.5-122b-a10b-gptq-int4vllmrocmllmgpuinferencebenchmarkself-hosted

DISCOVERED

24d ago

2026-03-18

PUBLISHED

24d ago

2026-03-18

RELEVANCE

8/ 10

AUTHOR

grunt_monkey_