REDDIT · REDDIT// 4h agoTUTORIAL

Devs pool mismatched GPUs to run 30B LLMs

A LocalLLaMA user shares a configuration to run dense 30B models by pooling VRAM from a primary 16GB GPU and an older 6GB card using llama-server. By asymmetrically splitting layers across PCIe slots, users can reach near-24GB capacity for local inference without buying enterprise hardware.

// ANALYSIS

Recycling older, low-VRAM GPUs alongside modern cards offers a highly cost-effective way to clear the 24GB hurdle for local AI.

–Asymmetrical VRAM splitting via llama-server's Vulkan backend enables mixing vastly different cards (e.g., a 16GB primary and a 6GB secondary)
–Disabling memory mapping and projecting ensures the model stays fully in VRAM for maximum speed
–Even with a secondary card running on a slower PCIe x4 slot, generation speeds remain highly usable (19 tokens/second)
–This approach democratizes access to 30B-class models without requiring specialized multi-GPU motherboards or dual identical cards

// TAGS

llama.cppgpuinferenceself-hostedllm

DISCOVERED

4h ago

2026-04-27

PUBLISHED

5h ago

2026-04-27

RELEVANCE

8/ 10

AUTHOR

akira3weet