OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoINFRASTRUCTURE
llama.cpp users test mixed GPUs
A LocalLLaMA user asks whether a 16GB RTX 4070 Ti Super and a 12GB RTX 2080-class card can be combined for llama.cpp inference across Windows, Ubuntu VM, and Proxmox. The short answer is yes in principle, but uneven VRAM, older CUDA support, and cross-machine latency make the setup more useful for experimentation than clean speed scaling.
// ANALYSIS
Mixed-GPU local inference is workable, but it is not the same as magically pooling VRAM into one fast card.
- –llama.cpp can split model layers across multiple GPUs, and uneven cards usually need explicit weighting with options such as tensor split rather than a simple 1:1 setup
- –Different NVIDIA generations can coexist, but the oldest card tends to constrain driver and CUDA choices
- –Splitting across separate machines or VMs pushes users toward llama.cpp RPC, where network latency can erase much of the benefit unless the link is fast
- –The practical win is fitting larger quantized GGUF models; throughput may still bottleneck on the slower GPU or PCIe/network path
// TAGS
llama-cppllminferencegpuself-hostedopen-source
DISCOVERED
6h ago
2026-04-23
PUBLISHED
7h ago
2026-04-22
RELEVANCE
7/ 10
AUTHOR
smolpotat0_x