BACK_TO_FEEDAICRIER_2
llama.cpp trims CUDA MoE overhead
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoINFRASTRUCTURE

llama.cpp trims CUDA MoE overhead

This merged PR cuts MMQ stream-k overhead in llama.cpp’s CUDA path by moving some division work to `fastdiv` and, when it is cheap enough, preferring tiling to skip stream-k fixup. The benchmark data shows mostly neutral results on dense workloads, but clear gains on MoE prompt processing, including up to about 17% on a 2x RTX 4090 run.

// ANALYSIS

This looks like the right kind of low-level optimization: small algorithmic and arithmetic changes that barely move dense cases, but unlock measurable gains where CUDA bookkeeping was the bottleneck.

  • The main win comes from reducing integer-division overhead in MMQ and deciding between stream-k and tiling based on estimated efficiency loss, not a hard-coded MoE rule.
  • Benchmarks show the uplift is concentrated in MoE prompt processing, especially on multi-GPU setups; dense-model runs are mostly flat, which suggests the patch is targeted rather than risky.
  • The implementation is cautious: `kbc` moved to 32-bit with host-side overflow assertions, while the remaining critical arithmetic stays 64-bit where needed.
  • For llama.cpp users running MoE models on NVIDIA hardware, this is the kind of change that matters more than a flashy headline feature because it improves the hot path directly.
// TAGS
llama.cppgpuinferenceopen-sourcellm

DISCOVERED

5h ago

2026-04-25

PUBLISHED

7h ago

2026-04-25

RELEVANCE

8/ 10

AUTHOR

jacek2023