Llama.cpp doubles prompt speed via PCIe optimization
Optimizing `llama.cpp` for multi-GPU setups with asymmetrical PCIe lanes (x16/x4) can double prompt processing speeds. By reordering devices using `CUDA_VISIBLE_DEVICES`, users can ensure the faster slot handles the primary prefill workload.
A simple environment variable change fixes a common bottleneck for consumer-grade multi-GPU setups where secondary slots are often electrically limited. Prefill performance is highly bandwidth-sensitive, making PCIe lane distribution critical for large models. Reordering GPUs ensures the primary device for initial processing sits in the x16 slot rather than an x4 chipset-linked slot, which is particularly beneficial for Mixture-of-Experts models that require frequent data shuffling. Users can verify bandwidth caps via nvtop or lspci to unlock this "free" performance on standard desktop platforms like the X570.
DISCOVERED
25d ago
2026-03-17
PUBLISHED
25d ago
2026-03-17
RELEVANCE
AUTHOR
overand