BACK_TO_FEEDAICRIER_2
llama.cpp users weigh 24GB Radeon split
OPEN_SOURCE ↗
REDDIT · REDDIT// 2h agoINFRASTRUCTURE

llama.cpp users weigh 24GB Radeon split

A LocalLLaMA thread asks whether an AMD OCuLink dGPU is worth stepping up from 16GB to 24GB for llama.cpp Vulkan inference, especially for Qwen 32B daily use and eventual 70B experiments. The other open question is whether an all-AMD Vulkan setup with a 780M iGPU plus dGPU behaves cleanly under tensor split.

// ANALYSIS

The short answer is that 24GB buys real headroom, but it does not magically make 70B easy; it mostly shifts you from "careful fitting" to "more comfortable fitting" for 32B-class models.

  • llama.cpp’s own README confirms Vulkan backend support and CPU+GPU hybrid inference, so the basic 780M + dGPU architecture is aligned with the project’s design.
  • GitHub threads show Vulkan device enumeration can distinguish multiple adapters cleanly, and `GGML_VK_VISIBLE_DEVICES` can force device selection, which is the key piece for an all-AMD split setup.
  • The risk is not device detection, it’s behavior under multi-GPU Vulkan: there are open and recent bug reports about tensor-split regressions, OOMs, and slowdowns on split workloads.
  • For a 32B daily driver, 24GB is the safer buy if budget allows; for 70B, the limiting factors quickly become quantization, context size, and CPU offload rather than just raw VRAM totals.
  • In practice, this is a "benchmark your exact model/quant" purchase, not a spec-sheet purchase, because Vulkan split performance can vary sharply by backend version and split mode.
// TAGS
llama-cppllmgpuinferenceopen-sourceself-hostedcli

DISCOVERED

2h ago

2026-04-19

PUBLISHED

4h ago

2026-04-19

RELEVANCE

8/ 10

AUTHOR

Pablo_Gates