llama.cpp users test mixed GPUs

// 90d agoINFRASTRUCTURE

llama.cpp users test mixed GPUs

A LocalLLaMA user asks whether a 16GB RTX 4070 Ti Super and a 12GB RTX 2080-class card can be combined for llama.cpp inference across Windows, Ubuntu VM, and Proxmox. The short answer is yes in principle, but uneven VRAM, older CUDA support, and cross-machine latency make the setup more useful for experimentation than clean speed scaling.

// ANALYSIS

Mixed-GPU local inference is workable, but it is not the same as magically pooling VRAM into one fast card.

–llama.cpp can split model layers across multiple GPUs, and uneven cards usually need explicit weighting with options such as tensor split rather than a simple 1:1 setup
–Different NVIDIA generations can coexist, but the oldest card tends to constrain driver and CUDA choices
–Splitting across separate machines or VMs pushes users toward llama.cpp RPC, where network latency can erase much of the benefit unless the link is fast
–The practical win is fitting larger quantized GGUF models; throughput may still bottleneck on the slower GPU or PCIe/network path

// TAGS

llama-cppllminferencegpuself-hostedopen-source

DISCOVERED

90d ago

2026-04-23

PUBLISHED

90d ago

2026-04-22

RELEVANCE

7/ 10

AUTHOR

smolpotat0_x

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

RESEARCH17m ago

USTC debuts PUMA to curb LLM overthinking

PUMA is a training-free diagnostic framework developed by USTC researchers to detect and mitigate cognitive stagnation in Large Reasoning Models. By analyzing geometric momentum and entropic uncertainty in latent space, PUMA enables inference engines to adaptively truncate redundant reasoning trajectories and lower token consumption.

MODEL40m ago

Google solicits feedback on Gemini 3.6 Flash

Jack Woth posted an inquiry on X requesting honest developer feedback on the performance and cost-efficiency of Gemini 3.6 Flash and Gemini 3.5 Flash-Lite. The goal of soliciting this community input is to ensure that the newly updated models reliably and efficiently handle real-world developer workflows and practical applications.

MODEL1h ago

Gemini 3.6 Flash cuts token costs 17%

A discussion on X argues that the majority of AI developers will soon prioritize cost efficiency over minor upgrades in raw intelligence. Highlighting Google's Gemini 3.6 Flash release, the author notes that its ~17% cost savings are essential for scaling autonomous AI agents and harnesses that continuously execute thousands of multi-step loops every week.