BACK_TO_FEEDAICRIER_2
llama.cpp splits LLMs across GPUs
OPEN_SOURCE ↗
REDDIT · REDDIT// 3d agoINFRASTRUCTURE

llama.cpp splits LLMs across GPUs

The post asks whether two P106-100 6GB mining cards can be combined to run Llama 3 8B as one local model with 12GB of effective VRAM. The community answer is that multi-GPU splitting is possible, but the real limits are runtime support, interconnect overhead, and how much context you try to keep resident.

// ANALYSIS

This is a classic local-LLM scaling question: yes, you can shard weights across GPUs, but on cheap older cards the upgrade buys capacity first, speed second.

  • `llama.cpp` supports multi-GPU splitting, and vLLM documents single-node tensor parallelism for models that do not fit on one card.
  • On Pascal-era mining GPUs, PCIe communication can become the bottleneck, so two cards rarely behave like one clean 12GB GPU.
  • Quantization and context size matter as much as raw weights; KV cache can eat the headroom you thought you gained.
  • If the stack is plain Transformers, `device_map="auto"` is not the same thing as true tensor parallelism.
  • For this use case, the smarter tradeoff is often a smaller quantized model before buying a second used GPU.
// TAGS
llama-cppllminferencegpuself-hostedopen-source

DISCOVERED

3d ago

2026-04-09

PUBLISHED

3d ago

2026-04-09

RELEVANCE

8/ 10

AUTHOR

HelicopterMountain47