llama.cpp multi-GPU bench needs row split

// 2h agoTUTORIAL

llama.cpp multi-GPU bench needs row split

A LocalLLaMA user hit a `llama-bench` quirk on a DGX-1-style setup: `--device CUDA0,CUDA1` caused the benchmark to sweep GPUs one at a time instead of tensor-splitting a Qwen3.6-27B run across both cards. The thread points toward `-sm row` for tensor parallelism, with `CUDA_VISIBLE_DEVICES` and `-ts 1,1` as the cleaner way to control which GPUs participate.

// ANALYSIS

The important bit is that `--device` is a visibility filter, not a magic “parallelize these two GPUs” switch, and `-sm tensor` is still experimental enough that the default behavior can surprise you.

–`llama-bench` runs combinations of test parameters, so the user’s `-d 4096,16384,65536` sweep was expected to iterate; the real mistake was assuming `--device CUDA0,CUDA1` would combine the GPUs into one shared benchmark run
–The maintainer guidance in llama.cpp says `-sm layer` is mostly pipeline parallelism for larger prompts, while `-sm row` is the tensor-parallel mode if you want work spread across GPUs
–On short or moderate contexts, multi-GPU overhead can erase the gain, so “two cards” does not automatically beat one fast card
–For repeatable benchmarking, `CUDA_VISIBLE_DEVICES=0,1` plus explicit split settings is usually less ambiguous than trying to encode topology directly in `--device`
–This is the kind of llama.cpp tuning problem that matters for real inference deployments, not just synthetic benchmarks

// TAGS

llama-cppllama-benchllmgpubenchmarkinferencedevtoolopen-source

DISCOVERED

2h ago

2026-05-09

PUBLISHED

4h ago

2026-05-09

RELEVANCE

8/ 10

AUTHOR

starkruzr

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

MODEL2h ago

Baidu launches ERNIE 5.1, slashes training cost

Baidu says ERNIE 5.1 inherits ERNIE 5.0’s foundation while shrinking total parameters to about one-third, active parameters to about one-half, and pre-training compute to roughly 6% of comparable models. It says the model improves agentic tasks, search, reasoning, and creative writing, with strong results on Arena Search.

OPEN SOURCE3h ago

Markus debuts AI workforce OS

Markus is an open-source platform for coordinating AI teams with persistent memory, built-in tools, and a responsive web UI. It pitches itself as a full runtime for messy repo work, not just another agent wrapper.

UPDATE3h ago

Claude Code 2.1.138 tightens CLI stability

Claude Code 2.1.138 is a maintenance release focused on internal fixes and better command stability. The update aims to cut down on unexpected errors rather than add new user-facing features.