YOU ARE VIEWING ONE ITEM FROM THE AICRIER FEED

Qwen3.5 GPTQ benchmarks on 4 Radeon GPUs

AICrier tracks AI developer news across Product Hunt, GitHub, Hacker News, YouTube, X, arXiv, and more. This page keeps the article you opened front and center while giving you a path into the live feed.

// WHAT AICRIER DOES

7+

TRACKED FEEDS

24/7

SCRAPED FEED

Short summaries, external links, screenshots, relevance scoring, tags, and featured picks for AI builders.

Qwen3.5 GPTQ benchmarks on 4 Radeon GPUs
OPEN LINK ↗
// 70d agoTUTORIAL

Qwen3.5 GPTQ benchmarks on 4 Radeon GPUs

A Reddit guide shows Qwen3.5-122B-A10B-GPTQ-Int4 running on four Radeon AI PRO R9700s under vLLM ROCm with a working long-context config. On a real 41k-context task, it reported 34.9 s TTFT, 101.7 s total time, and roughly 4,150 tok/s prompt throughput.

// ANALYSIS

This is a strong local-inference win for AMD users: GPTQ Int4 plus vLLM ROCm makes a 122B-class model practical, and the prefill speedup over llama.cpp is the real headline. The catch is that this is still a tuning-heavy setup, not a clean victory lap.

  • Standard HF weights OOMed, so quantization was the enabler that made the deployment fit at the target context length.
  • The author had to disable Qwen’s thinking mode via `chat_template_kwargs`, which matters for latency and output control in real workflows.
  • Compared with llama.cpp, the throughput numbers are dramatically better on prefill, but the user still saw worse quality than Q5_K_XL on their task.
  • Thermal and power costs are non-trivial: all four GPUs ran hot at 90C+, and idle draw was about 90 W per GPU.
  • The post also hints that ROCm/MoE support still has room to mature, so future driver and runtime improvements could move the ceiling higher.
// TAGS
qwen3.5-122b-a10b-gptq-int4vllmrocmllmgpuinferencebenchmarkself-hosted

DISCOVERED

70d ago

2026-03-18

PUBLISHED

70d ago

2026-03-18

RELEVANCE

8/ 10

AUTHOR

grunt_monkey_