OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoINFRASTRUCTURE
RX 580 Vulkan hits 16 t/s ceiling on llama.cpp
A LocalLLaMA user running llama.cpp with the Vulkan backend on an AMD RX 580 (Polaris, gfx803) reports a hard performance ceiling of ~16 t/s on Qwen3.5-4B Q4_K_M, despite all GPU layers offloaded and ample VRAM headroom. The bottleneck traces back to Polaris lacking hardware matrix acceleration in RADV, forcing all matmul ops through generic fp32 shaders.
// ANALYSIS
The RX 580 Vulkan experiment exposes a real gap: theoretical memory bandwidth (256 GB/s) vs. actual utilization (~15%), revealing how critical hardware matrix ops are for LLM inference throughput.
- –Polaris (gfx803) has no fp16, bf16, or int dot product acceleration in RADV — every matrix multiply runs as a generic fp32 compute shader, which is massively inefficient for transformer attention patterns
- –The gap between theoretical ~100 t/s (bandwidth-bound) and actual ~16 t/s is the real cost of missing tensor core equivalents on older AMD hardware
- –ROCm with HIP (DGGML_HIPBLAS=ON targeting gfx803) is the realistic path forward — Vulkan lacks the low-level primitives to close this gap on Polaris
- –llama.cpp's Vulkan backend is solid for supported hardware but cannot compensate for missing ISA features; no amount of flag tuning helps
- –This is a useful data point for anyone evaluating old AMD GPUs for local inference — Vulkan is not a universal fallback that extracts full hardware performance
// TAGS
llama.cppinferencegpuopen-sourceedge-ai
DISCOVERED
29d ago
2026-03-14
PUBLISHED
29d ago
2026-03-14
RELEVANCE
5/ 10
AUTHOR
Numerous_Sandwich_62