Transformer MLP linearization cuts compute, sometimes improves perplexity
A new March 3, 2026 arXiv paper shows many transformer MLP passes can be replaced by a precomputed linear matrix selected by a tiny context-based gate, with 25-56% cheap-path routing in GPT-2 at under 1% perplexity cost. The authors also report that progressively linearizing middle GPT-2 layers can improve perplexity versus baseline, suggesting some nonlinear capacity is misallocated.
This is one of the more practically interesting efficiency papers this year because it targets the most expensive part of decoder inference without requiring a full architectural rewrite.
- –The key claim is context-dependent routing, not token lookup, which makes the method harder to cache statically but more realistic in live inference.
- –Gains are architecture-sensitive: GPT-2 responds well, while Pythia is tougher, so transfer to newer families is promising but unproven.
- –The reported perplexity improvement from partial linearization hints at regularization benefits, not just speedups.
- –If replicated on modern SwiGLU models, this could become a low-overhead optimization path for local and edge serving stacks.
DISCOVERED
83d ago
2026-03-05
PUBLISHED
83d ago
2026-03-05
RELEVANCE
AUTHOR
Interesting_Meat_900