OPEN_SOURCE ↗
REDDIT · REDDIT// 4h agoTUTORIAL
Devs pool mismatched GPUs to run 30B LLMs
A LocalLLaMA user shares a configuration to run dense 30B models by pooling VRAM from a primary 16GB GPU and an older 6GB card using llama-server. By asymmetrically splitting layers across PCIe slots, users can reach near-24GB capacity for local inference without buying enterprise hardware.
// ANALYSIS
Recycling older, low-VRAM GPUs alongside modern cards offers a highly cost-effective way to clear the 24GB hurdle for local AI.
- –Asymmetrical VRAM splitting via llama-server's Vulkan backend enables mixing vastly different cards (e.g., a 16GB primary and a 6GB secondary)
- –Disabling memory mapping and projecting ensures the model stays fully in VRAM for maximum speed
- –Even with a secondary card running on a slower PCIe x4 slot, generation speeds remain highly usable (19 tokens/second)
- –This approach democratizes access to 30B-class models without requiring specialized multi-GPU motherboards or dual identical cards
// TAGS
llama.cppgpuinferenceself-hostedllm
DISCOVERED
4h ago
2026-04-27
PUBLISHED
5h ago
2026-04-27
RELEVANCE
8/ 10
AUTHOR
akira3weet