OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoINFRASTRUCTURE
llama.cpp trims CUDA MoE overhead
This merged PR cuts MMQ stream-k overhead in llama.cpp’s CUDA path by moving some division work to `fastdiv` and, when it is cheap enough, preferring tiling to skip stream-k fixup. The benchmark data shows mostly neutral results on dense workloads, but clear gains on MoE prompt processing, including up to about 17% on a 2x RTX 4090 run.
// ANALYSIS
This looks like the right kind of low-level optimization: small algorithmic and arithmetic changes that barely move dense cases, but unlock measurable gains where CUDA bookkeeping was the bottleneck.
- –The main win comes from reducing integer-division overhead in MMQ and deciding between stream-k and tiling based on estimated efficiency loss, not a hard-coded MoE rule.
- –Benchmarks show the uplift is concentrated in MoE prompt processing, especially on multi-GPU setups; dense-model runs are mostly flat, which suggests the patch is targeted rather than risky.
- –The implementation is cautious: `kbc` moved to 32-bit with host-side overflow assertions, while the remaining critical arithmetic stays 64-bit where needed.
- –For llama.cpp users running MoE models on NVIDIA hardware, this is the kind of change that matters more than a flashy headline feature because it improves the hot path directly.
// TAGS
llama.cppgpuinferenceopen-sourcellm
DISCOVERED
5h ago
2026-04-25
PUBLISHED
7h ago
2026-04-25
RELEVANCE
8/ 10
AUTHOR
jacek2023