BACK_TO_FEEDAICRIER_2
Ollama splits local LLMs across GPUs
OPEN_SOURCE ↗
REDDIT · REDDIT// 1d agoINFRASTRUCTURE

Ollama splits local LLMs across GPUs

Redditors say mixed-GPU local inference can work, especially in Ollama and llama.cpp, but the extra card usually helps capacity more than speed. If the model fits on the 4070 alone, that remains the faster path; if not, the runtime can spread it across both GPUs.

// ANALYSIS

Worth doing if your goal is to fit a larger model or run separate models in parallel, not if you expect a clean pooled-VRAM upgrade. The 1070 will add usable memory, but PCIe traffic and the older card’s slower compute will cap throughput.

  • Ollama’s docs say it will load on a single GPU when the model fits there, and only spread across GPUs when it must.
  • Community replies in the thread report that mixed NVIDIA setups work with Ollama and llama.cpp.
  • The 1070’s Pascal-era hardware means it is likely to be the bottleneck on generation speed.
  • Best case: use the 1070 for a second model or auxiliary workload instead of forcing one model to crawl across both cards.
  • For routine local LLM use, a single larger-VRAM GPU is still the cleaner setup.
// TAGS
llmgpuinferencelocal-firstollama

DISCOVERED

1d ago

2026-05-02

PUBLISHED

1d ago

2026-05-02

RELEVANCE

7/ 10

AUTHOR

ShadowBannedAugustus