BACK_TO_FEEDAICRIER_2
llama.cpp users test mixed GPUs
OPEN_SOURCE ↗
REDDIT · REDDIT// 6h agoINFRASTRUCTURE

llama.cpp users test mixed GPUs

A LocalLLaMA user asks whether a 16GB RTX 4070 Ti Super and a 12GB RTX 2080-class card can be combined for llama.cpp inference across Windows, Ubuntu VM, and Proxmox. The short answer is yes in principle, but uneven VRAM, older CUDA support, and cross-machine latency make the setup more useful for experimentation than clean speed scaling.

// ANALYSIS

Mixed-GPU local inference is workable, but it is not the same as magically pooling VRAM into one fast card.

  • llama.cpp can split model layers across multiple GPUs, and uneven cards usually need explicit weighting with options such as tensor split rather than a simple 1:1 setup
  • Different NVIDIA generations can coexist, but the oldest card tends to constrain driver and CUDA choices
  • Splitting across separate machines or VMs pushes users toward llama.cpp RPC, where network latency can erase much of the benefit unless the link is fast
  • The practical win is fitting larger quantized GGUF models; throughput may still bottleneck on the slower GPU or PCIe/network path
// TAGS
llama-cppllminferencegpuself-hostedopen-source

DISCOVERED

6h ago

2026-04-23

PUBLISHED

7h ago

2026-04-22

RELEVANCE

7/ 10

AUTHOR

smolpotat0_x