BACK_TO_FEEDAICRIER_2
RX 580 Vulkan hits 16 t/s ceiling on llama.cpp
OPEN_SOURCE ↗
REDDIT · REDDIT// 29d agoINFRASTRUCTURE

RX 580 Vulkan hits 16 t/s ceiling on llama.cpp

A LocalLLaMA user running llama.cpp with the Vulkan backend on an AMD RX 580 (Polaris, gfx803) reports a hard performance ceiling of ~16 t/s on Qwen3.5-4B Q4_K_M, despite all GPU layers offloaded and ample VRAM headroom. The bottleneck traces back to Polaris lacking hardware matrix acceleration in RADV, forcing all matmul ops through generic fp32 shaders.

// ANALYSIS

The RX 580 Vulkan experiment exposes a real gap: theoretical memory bandwidth (256 GB/s) vs. actual utilization (~15%), revealing how critical hardware matrix ops are for LLM inference throughput.

  • Polaris (gfx803) has no fp16, bf16, or int dot product acceleration in RADV — every matrix multiply runs as a generic fp32 compute shader, which is massively inefficient for transformer attention patterns
  • The gap between theoretical ~100 t/s (bandwidth-bound) and actual ~16 t/s is the real cost of missing tensor core equivalents on older AMD hardware
  • ROCm with HIP (DGGML_HIPBLAS=ON targeting gfx803) is the realistic path forward — Vulkan lacks the low-level primitives to close this gap on Polaris
  • llama.cpp's Vulkan backend is solid for supported hardware but cannot compensate for missing ISA features; no amount of flag tuning helps
  • This is a useful data point for anyone evaluating old AMD GPUs for local inference — Vulkan is not a universal fallback that extracts full hardware performance
// TAGS
llama.cppinferencegpuopen-sourceedge-ai

DISCOVERED

29d ago

2026-03-14

PUBLISHED

29d ago

2026-03-14

RELEVANCE

5/ 10

AUTHOR

Numerous_Sandwich_62