Triton MoE kernel beats CUDA at inference

// 99d agoOPENSOURCE RELEASE

Triton MoE kernel beats CUDA at inference

A pure Triton implementation of Mixture-of-Experts dispatch outperforms Stanford's CUDA-optimized Megablocks for inference batch sizes. By fusing gate and up projections, the kernel reduces memory traffic by 35% and maintains high performance across NVIDIA and AMD hardware.

// ANALYSIS

This project demonstrates that Triton is no longer just a "fast enough" alternative to CUDA; it can actually exceed hand-tuned C++ when algorithmic fusion is prioritized.

–Fusing gate+up projections into a single tile load eliminates nearly 500MB of intermediate buffers, a massive win for memory-bound MoE inference.
–The block-scheduled grouped GEMM handles variable expert batch sizes without the padding overhead that typically plagues naive MoE implementations.
–Achieving 131% of Megablocks' speed at small batch sizes (32 tokens) makes this specifically relevant for real-time chat and agentic workflows.
–Zero-change portability to AMD MI300X highlights the growing maturity of the Triton ecosystem for multi-vendor AI infrastructure.

// TAGS

triton-moe-dispatchtritonllminferencegpumixtraldeepseekopen-source

DISCOVERED

99d ago

2026-04-05

PUBLISHED

99d ago

2026-04-05

RELEVANCE

8/ 10

AUTHOR

bassrehab

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

TUTORIAL12m ago

Tutorial runs MiniMax M3 inside Claude Code

A recent YouTube video explores how developers can integrate the MiniMax M3 model into Claude Code. MiniMax M3 is an open-weight mixture-of-experts (MoE) model that boasts a massive 1-million-token context window and strong performance on coding benchmarks, making it a viable alternative to Claude's native models for users hitting usage constraints.

NEWS57m ago

Tiny Army, Eyas win Build Small hackathon

Cohere co-sponsored Hugging Face's 'Build Small' hackathon, which challenged developers to create useful, whimsical, or cool applications using smaller, more efficient AI models. Two projects powered by Cohere's models received awards: 'Tiny Army,' an interactive game by @polats where players describe and create their own heroes, won second place on the Thousand-Token Wood track; and 'Eyas,' a security camera agent built by Hanhee Lee, Javier Huang, and Joe Lee to solve real-world security needs for a family convenience store, won the Best Agent award.

LAUNCH1h ago

Netlify enables one-click deploys in Claude

Netlify has partnered with Anthropic to bring direct, one-click deployments to Claude, allowing users to ship Claude-designed web applications straight to production by typing "Deploy to Netlify" in Claude chat. This integration removes the friction of manual exports and re-uploads, and also supports pairing Claude Code with Netlify Agent Runners to add databases, authentication, and serverless functions.