BACK_TO_FEEDAICRIER_2
llama.cpp tensor split boosts dual-GPU speed
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

llama.cpp tensor split boosts dual-GPU speed

llama.cpp’s experimental -sm tensor mode is showing real multi-GPU gains on consumer hardware, pushing a dual 3090 Ti setup well ahead of single-card throughput on Qwen3.6-27B. The benchmark suggests the new split strategy is no longer just a curiosity; it can materially improve both prompt processing and token generation.

// ANALYSIS

This is the kind of “free gains” benchmark that changes hardware advice fast: once tensor split is stable enough, the best upgrade path for local LLMs may be adding a second card instead of chasing a bigger single GPU.

  • The jump from 1580/44 t/s on one 3090 Ti to 2047/58 t/s on two cards is meaningful, especially for prompt-heavy workloads
  • `-sm tensor` appears to outperform older layer-splitting behavior on this setup, which matters for users trying to scale past one GPU
  • The result is from a mainstream `llama.cpp` build, so the optimization is moving from experimental patch territory toward practical default-adjacent usage
  • The gains are still workload-dependent, but the benchmark shows multi-GPU inference can now scale without needing datacenter-class hardware
  • For LocalLLaMA users, this is a strong signal that consumer dual-GPU rigs are getting better value from the software stack itself, not just from raw VRAM
// TAGS
llama-cppllmbenchmarkgpuinferenceopen-source

DISCOVERED

3h ago

2026-04-29

PUBLISHED

6h ago

2026-04-29

RELEVANCE

8/ 10

AUTHOR

Ok-Measurement-1575