MoE models top Dense with 7x leverage
Ant Group researchers introduce Efficiency Leverage (EL), a new metric proving that MoE models like Ling-mini-beta (0.85B active) match 6.1B dense models with 7x less compute. The study establishes unified scaling laws showing that MoE's efficiency advantage actually increases as training compute scales.
MoE isn't just about parameter count; it's a fundamental compute leverage play that gets stronger as models grow.
- –Efficiency Leverage (EL) quantified as a predictable power law driven by expert activation and compute budget.
- –Empirical testing on 1T tokens shows 0.85B active MoE parameters matching 6.1B dense parameters.
- –Optimal expert granularity identified at 8-12 experts, providing a blueprint for future model architectures.
- –Data hunger remains the "tax" for MoE, requiring more tokens than dense counterparts for optimal compute efficiency.
- –Unified scaling law suggests we haven't hit the ceiling on MoE efficiency gains yet.
DISCOVERED
45d ago
2026-04-28
PUBLISHED
45d ago
2026-04-28
RELEVANCE
AUTHOR
Different_Fix_2217