OPEN_SOURCE ↗
REDDIT · REDDIT// 37d agoRESEARCH PAPER
Transformer MLP linearization cuts compute, sometimes improves perplexity
A new March 3, 2026 arXiv paper shows many transformer MLP passes can be replaced by a precomputed linear matrix selected by a tiny context-based gate, with 25-56% cheap-path routing in GPT-2 at under 1% perplexity cost. The authors also report that progressively linearizing middle GPT-2 layers can improve perplexity versus baseline, suggesting some nonlinear capacity is misallocated.
// ANALYSIS
This is one of the more practically interesting efficiency papers this year because it targets the most expensive part of decoder inference without requiring a full architectural rewrite.
- –The key claim is context-dependent routing, not token lookup, which makes the method harder to cache statically but more realistic in live inference.
- –Gains are architecture-sensitive: GPT-2 responds well, while Pythia is tougher, so transfer to newer families is promising but unproven.
- –The reported perplexity improvement from partial linearization hints at regularization benefits, not just speedups.
- –If replicated on modern SwiGLU models, this could become a low-overhead optimization path for local and edge serving stacks.
// TAGS
half-the-nonlinearity-is-wastedllminferenceresearch
DISCOVERED
37d ago
2026-03-05
PUBLISHED
37d ago
2026-03-05
RELEVANCE
9/ 10
AUTHOR
Interesting_Meat_900