OPEN_SOURCE ↗
REDDIT · REDDIT// 3d agoINFRASTRUCTURE
llama.cpp splits LLMs across GPUs
The post asks whether two P106-100 6GB mining cards can be combined to run Llama 3 8B as one local model with 12GB of effective VRAM. The community answer is that multi-GPU splitting is possible, but the real limits are runtime support, interconnect overhead, and how much context you try to keep resident.
// ANALYSIS
This is a classic local-LLM scaling question: yes, you can shard weights across GPUs, but on cheap older cards the upgrade buys capacity first, speed second.
- –`llama.cpp` supports multi-GPU splitting, and vLLM documents single-node tensor parallelism for models that do not fit on one card.
- –On Pascal-era mining GPUs, PCIe communication can become the bottleneck, so two cards rarely behave like one clean 12GB GPU.
- –Quantization and context size matter as much as raw weights; KV cache can eat the headroom you thought you gained.
- –If the stack is plain Transformers, `device_map="auto"` is not the same thing as true tensor parallelism.
- –For this use case, the smarter tradeoff is often a smaller quantized model before buying a second used GPU.
// TAGS
llama-cppllminferencegpuself-hostedopen-source
DISCOVERED
3d ago
2026-04-09
PUBLISHED
3d ago
2026-04-09
RELEVANCE
8/ 10
AUTHOR
HelicopterMountain47