BACK_TO_FEEDAICRIER_2
Qwen3.6-27B tests 3-GPU speed ceiling
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

Qwen3.6-27B tests 3-GPU speed ceiling

A Reddit user reports 18-20 t/s generation and about 650 t/s prompt processing on Q8 quants across three Radeon 7900 XTX GPUs in llama.cpp. The post asks whether those numbers are normal and what tuning tricks actually move the needle in multi-GPU AMD setups.

// ANALYSIS

The numbers do not look wildly off for a dense 27B model on consumer AMD hardware, but they do suggest the decode path is the bottleneck, not raw VRAM.

  • 27B Q8 across 3x 7900 XTX is already a high-friction inference setup, so scaling gains will come from tuning more than from simply adding cards
  • 650 t/s prompt processing is decent; the gap is that decode speed often flattens out because of split overhead, synchronization, and KV-cache behavior
  • This is a useful real-world datapoint because AMD multi-GPU llama.cpp performance is discussed far less often than CUDA/NVIDIA setups
  • The most relevant knobs are likely tensor split strategy, batch sizes, context settings, and build/driver versions rather than the model alone
// TAGS
qwen3llama-cppai-codinginferencegpuself-hostedopen-weightsbenchmark

DISCOVERED

3h ago

2026-04-25

PUBLISHED

7h ago

2026-04-24

RELEVANCE

8/ 10

AUTHOR

SemaMod