OPEN_SOURCE ↗
REDDIT · REDDIT// 20d agoBENCHMARK RESULT
llama.cpp Mi50 benchmarks pit ROCm, Vulkan
On a Mi50 32GB, llama.cpp's TheRock ROCm 7.13 nightly and Vulkan backend split wins by workload. Vulkan is quicker for short-context dense-model chat, while ROCm takes over once context stretches or MoE and CPU-offload paths enter the mix.
// ANALYSIS
This looks less like a universal backend verdict than a reminder that local inference performance is shaped by context length and model topology.
- –For dense models, Vulkan wins the interactive end: Qwen 3.5 9B prompt processing is 871.17 t/s on Vulkan at 512 tokens vs 708.58 on ROCm, and 27B is 252.68 vs 209.06.
- –The crossover shows up fast: at 32k context, 9B prompt processing flips to 593.8 t/s on ROCm vs 447.76 on Vulkan, and 27B flips to 176.69 vs 128.72.
- –Generation stays more Vulkan-friendly on dense models, so the ROCm win mostly comes from prompt processing amortizing the whole session rather than raw token-by-token speed.
- –The 122B run with 28 layers offloaded to CPU is where ROCm really earns its keep: at 32k, tg is 24.65 t/s on ROCm vs 18.41 on Vulkan, while pp is 153.16 vs 113.16.
- –Nightly ROCm is still a risk tradeoff: the reported llama-server prompt-cache OOM and earlier leak-like behavior make these results useful, but not production-safe.
// TAGS
llama-cppllmgpuinferencebenchmarkopen-source
DISCOVERED
20d ago
2026-03-22
PUBLISHED
20d ago
2026-03-22
RELEVANCE
8/ 10
AUTHOR
JaredsBored