OPEN_SOURCE ↗
REDDIT · REDDIT// 6d agoOPENSOURCE RELEASE
Triton MoE kernel beats CUDA at inference
A pure Triton implementation of Mixture-of-Experts dispatch outperforms Stanford's CUDA-optimized Megablocks for inference batch sizes. By fusing gate and up projections, the kernel reduces memory traffic by 35% and maintains high performance across NVIDIA and AMD hardware.
// ANALYSIS
This project demonstrates that Triton is no longer just a "fast enough" alternative to CUDA; it can actually exceed hand-tuned C++ when algorithmic fusion is prioritized.
- –Fusing gate+up projections into a single tile load eliminates nearly 500MB of intermediate buffers, a massive win for memory-bound MoE inference.
- –The block-scheduled grouped GEMM handles variable expert batch sizes without the padding overhead that typically plagues naive MoE implementations.
- –Achieving 131% of Megablocks' speed at small batch sizes (32 tokens) makes this specifically relevant for real-time chat and agentic workflows.
- –Zero-change portability to AMD MI300X highlights the growing maturity of the Triton ecosystem for multi-vendor AI infrastructure.
// TAGS
triton-moe-dispatchtritonllminferencegpumixtraldeepseekopen-source
DISCOVERED
6d ago
2026-04-05
PUBLISHED
6d ago
2026-04-05
RELEVANCE
8/ 10
AUTHOR
bassrehab