Llama.cpp doubles prompt speed via PCIe optimization

// 117d agoTUTORIAL

Llama.cpp doubles prompt speed via PCIe optimization

Optimizing `llama.cpp` for multi-GPU setups with asymmetrical PCIe lanes (x16/x4) can double prompt processing speeds. By reordering devices using `CUDA_VISIBLE_DEVICES`, users can ensure the faster slot handles the primary prefill workload.

// ANALYSIS

A simple environment variable change fixes a common bottleneck for consumer-grade multi-GPU setups where secondary slots are often electrically limited. Prefill performance is highly bandwidth-sensitive, making PCIe lane distribution critical for large models. Reordering GPUs ensures the primary device for initial processing sits in the x16 slot rather than an x4 chipset-linked slot, which is particularly beneficial for Mixture-of-Experts models that require frequent data shuffling. Users can verify bandwidth caps via nvtop or lspci to unlock this "free" performance on standard desktop platforms like the X570.

// TAGS

llama-cppllmgpuinferenceopen-source

DISCOVERED

117d ago

2026-03-17

PUBLISHED

117d ago

2026-03-17

RELEVANCE

8/ 10

AUTHOR

overand

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

INFRA57m ago

GLM-5 runs natively on Ascend via FlagOS

Zhipu AI's GLM-5 has been packaged for native execution on Huawei Ascend NPUs using the FlagOS framework, representing the first CUDA-free deployment of a Chinese general-purpose LLM on domestic hardware. This integration satisfies local sovereignty requirements across hardware, model, and inference runtime in a single package.

INFRA1h ago

Alchemy enables declarative agentic infrastructure

Sam Goodwin shared a declarative workflow for constructing agentic infrastructure using Alchemy, combining English prompts and TypeScript code in a single TypeScript file. By utilizing string template literals and a simple alchemy deploy command, developers can deploy applications directly to the cloud without manual environment setup.

BENCHMARK2h ago

Gemini 3.5 Pro Tops Rivals in Leak

A leaked benchmark report claims that Google's rumored Gemini 3.5 Pro model achieves superior performance compared to rival models Claude Fable 5 and GPT-5.6 in internal evaluations. The leak suggests significant advancements in Google's next-generation frontier AI model, though official validation is still pending.