OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
Qwen3.6-27B tests 3-GPU speed ceiling
A Reddit user reports 18-20 t/s generation and about 650 t/s prompt processing on Q8 quants across three Radeon 7900 XTX GPUs in llama.cpp. The post asks whether those numbers are normal and what tuning tricks actually move the needle in multi-GPU AMD setups.
// ANALYSIS
The numbers do not look wildly off for a dense 27B model on consumer AMD hardware, but they do suggest the decode path is the bottleneck, not raw VRAM.
- –27B Q8 across 3x 7900 XTX is already a high-friction inference setup, so scaling gains will come from tuning more than from simply adding cards
- –650 t/s prompt processing is decent; the gap is that decode speed often flattens out because of split overhead, synchronization, and KV-cache behavior
- –This is a useful real-world datapoint because AMD multi-GPU llama.cpp performance is discussed far less often than CUDA/NVIDIA setups
- –The most relevant knobs are likely tensor split strategy, batch sizes, context settings, and build/driver versions rather than the model alone
// TAGS
qwen3llama-cppai-codinginferencegpuself-hostedopen-weightsbenchmark
DISCOVERED
3h ago
2026-04-25
PUBLISHED
7h ago
2026-04-24
RELEVANCE
8/ 10
AUTHOR
SemaMod