OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
llama.cpp tensor split boosts dual-GPU speed
llama.cpp’s experimental -sm tensor mode is showing real multi-GPU gains on consumer hardware, pushing a dual 3090 Ti setup well ahead of single-card throughput on Qwen3.6-27B. The benchmark suggests the new split strategy is no longer just a curiosity; it can materially improve both prompt processing and token generation.
// ANALYSIS
This is the kind of “free gains” benchmark that changes hardware advice fast: once tensor split is stable enough, the best upgrade path for local LLMs may be adding a second card instead of chasing a bigger single GPU.
- –The jump from 1580/44 t/s on one 3090 Ti to 2047/58 t/s on two cards is meaningful, especially for prompt-heavy workloads
- –`-sm tensor` appears to outperform older layer-splitting behavior on this setup, which matters for users trying to scale past one GPU
- –The result is from a mainstream `llama.cpp` build, so the optimization is moving from experimental patch territory toward practical default-adjacent usage
- –The gains are still workload-dependent, but the benchmark shows multi-GPU inference can now scale without needing datacenter-class hardware
- –For LocalLLaMA users, this is a strong signal that consumer dual-GPU rigs are getting better value from the software stack itself, not just from raw VRAM
// TAGS
llama-cppllmbenchmarkgpuinferenceopen-source
DISCOVERED
3h ago
2026-04-29
PUBLISHED
6h ago
2026-04-29
RELEVANCE
8/ 10
AUTHOR
Ok-Measurement-1575