llama.cpp b9095 adds NCCL-free tensor parallelism
llama.cpp b9095 adds an internal CUDA AllReduce path for `LLAMA_SPLIT_MODE_TENSOR`, letting dual-GPU setups run tensor parallelism without NCCL. The release notes call out a current target of 2 GPUs, FP32, and tensors up to 256 KB.
This is a meaningful infrastructure step for local inference: it lowers the dependency burden for multi-GPU tensor parallelism and makes dual consumer Blackwell rigs easier to bring up.
- –The new internal AllReduce is explicitly NCCL-free, which matters most on desktop-class NVIDIA setups where NCCL can be a setup friction point
- –The implementation is still narrow in scope, so this is a practical win for specific dual-GPU workflows rather than a universal multi-GPU answer
- –The release notes say the kernel works on Volta-or-newer NVIDIA GPUs, so the impact is broader than the Reddit title implies
- –`GGML_CUDA_ALLREDUCE` and `--allreduce` make it easy to compare internal vs NCCL paths and debug regressions
- –For local model builders, this kind of plumbing change can improve throughput and reliability without changing the model stack
DISCOVERED
2h ago
2026-05-10
PUBLISHED
5h ago
2026-05-10
RELEVANCE
AUTHOR
Bulky-Priority6824

