OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoINFRASTRUCTURE
llama.cpp mixed-GPU setups spark tuning debate
This Reddit post asks how to get the best coding performance out of llama.cpp across a 3080, an RX 9070 XT, and an iGPU, with a particular focus on quantization quality, VRAM limits, and multi-GPU stability. The core question is whether to keep a hybrid Vulkan setup, move the 3080 into the main PC, or rely on a single stronger discrete GPU.
// ANALYSIS
The practical answer is usually “simpler beats clever” here: llama.cpp supports hybrid inference and multiple backends, but multi-GPU Vulkan setups are still flaky enough that stability and device ordering matter as much as raw VRAM.
- –The thread reflects a real tradeoff for local coding models: the RX 9070 XT gives better single-card headroom than the 3080, but the 3080 can unlock higher quants if it is the only or primary inference GPU.
- –llama.cpp can build with both CUDA and Vulkan, and runtime device selection exists, but mixed-brand, cross-backend multi-GPU inference is not the clean path; the safer bet is to keep one discrete GPU as the main target and treat the iGPU as a fallback only if needed.
- –The user’s observation that the iGPU gets picked first is consistent with Vulkan device enumeration behavior, so explicit device selection is the key lever, not hoping the loader “knows” the faster card is preferable.
- –The crash reports are the bigger signal than the tok/s numbers: once split-mode starts failing, the theoretical extra VRAM stops mattering because you lose reliability and spend time debugging instead of coding.
- –For coding workloads, a stable single-GPU Q4 setup often beats a brittle split configuration that only looks better on paper.
// TAGS
llama-cppllmgpuinferencecliopen-sourceself-hostedai-coding
DISCOVERED
4h ago
2026-04-19
PUBLISHED
5h ago
2026-04-19
RELEVANCE
8/ 10
AUTHOR
PiHeich