RX 580 Vulkan hits 16 t/s ceiling on llama.cpp
A LocalLLaMA user running llama.cpp with the Vulkan backend on an AMD RX 580 (Polaris, gfx803) reports a hard performance ceiling of ~16 t/s on Qwen3.5-4B Q4_K_M, despite all GPU layers offloaded and ample VRAM headroom. The bottleneck traces back to Polaris lacking hardware matrix acceleration in RADV, forcing all matmul ops through generic fp32 shaders.
The RX 580 Vulkan experiment exposes a real gap: theoretical memory bandwidth (256 GB/s) vs. actual utilization (~15%), revealing how critical hardware matrix ops are for LLM inference throughput.
- –Polaris (gfx803) has no fp16, bf16, or int dot product acceleration in RADV — every matrix multiply runs as a generic fp32 compute shader, which is massively inefficient for transformer attention patterns
- –The gap between theoretical ~100 t/s (bandwidth-bound) and actual ~16 t/s is the real cost of missing tensor core equivalents on older AMD hardware
- –ROCm with HIP (DGGML_HIPBLAS=ON targeting gfx803) is the realistic path forward — Vulkan lacks the low-level primitives to close this gap on Polaris
- –llama.cpp's Vulkan backend is solid for supported hardware but cannot compensate for missing ISA features; no amount of flag tuning helps
- –This is a useful data point for anyone evaluating old AMD GPUs for local inference — Vulkan is not a universal fallback that extracts full hardware performance
DISCOVERED
76d ago
2026-03-14
PUBLISHED
76d ago
2026-03-14
RELEVANCE
AUTHOR
Numerous_Sandwich_62
