llama.cpp multi-GPU bench needs row split
A LocalLLaMA user hit a `llama-bench` quirk on a DGX-1-style setup: `--device CUDA0,CUDA1` caused the benchmark to sweep GPUs one at a time instead of tensor-splitting a Qwen3.6-27B run across both cards. The thread points toward `-sm row` for tensor parallelism, with `CUDA_VISIBLE_DEVICES` and `-ts 1,1` as the cleaner way to control which GPUs participate.
The important bit is that `--device` is a visibility filter, not a magic “parallelize these two GPUs” switch, and `-sm tensor` is still experimental enough that the default behavior can surprise you.
- –`llama-bench` runs combinations of test parameters, so the user’s `-d 4096,16384,65536` sweep was expected to iterate; the real mistake was assuming `--device CUDA0,CUDA1` would combine the GPUs into one shared benchmark run
- –The maintainer guidance in llama.cpp says `-sm layer` is mostly pipeline parallelism for larger prompts, while `-sm row` is the tensor-parallel mode if you want work spread across GPUs
- –On short or moderate contexts, multi-GPU overhead can erase the gain, so “two cards” does not automatically beat one fast card
- –For repeatable benchmarking, `CUDA_VISIBLE_DEVICES=0,1` plus explicit split settings is usually less ambiguous than trying to encode topology directly in `--device`
- –This is the kind of llama.cpp tuning problem that matters for real inference deployments, not just synthetic benchmarks
DISCOVERED
2h ago
2026-05-09
PUBLISHED
4h ago
2026-05-09
RELEVANCE
AUTHOR
starkruzr
