Triton MoE kernel tops MegaBlocks on Mixtral
A pure Triton implementation of a fused MoE dispatch kernel that optimizes inference by reducing kernel launches and memory traffic. It outperforms Stanford's MegaBlocks at common real-time inference batch sizes (32-128 tokens) while maintaining cross-platform portability across NVIDIA A100 and AMD MI300X GPUs.
This project proves Triton's viability for beating hand-tuned CUDA in production inference scenarios, especially for complex MoE architectures. It fuses Gate and Up GEMMs to share L2 cache input tiles, saving ~470MB of memory traffic per forward pass on Mixtral-8x7B. The implementation reduces the MoE forward pass from 24+ kernel launches to just five, significantly cutting overhead for small-batch inference. It employs a custom block-scheduling approach to handle expert load imbalance without memory waste and delivers cross-platform performance on AMD MI300X with zero code changes.
DISCOVERED
6d ago
2026-04-05
PUBLISHED
6d ago
2026-04-05
RELEVANCE
AUTHOR
bassrehab