BACK_TO_FEEDAICRIER_2
Transformer MLP linearization cuts compute, sometimes improves perplexity
OPEN_SOURCE ↗
REDDIT · REDDIT// 37d agoRESEARCH PAPER

Transformer MLP linearization cuts compute, sometimes improves perplexity

A new March 3, 2026 arXiv paper shows many transformer MLP passes can be replaced by a precomputed linear matrix selected by a tiny context-based gate, with 25-56% cheap-path routing in GPT-2 at under 1% perplexity cost. The authors also report that progressively linearizing middle GPT-2 layers can improve perplexity versus baseline, suggesting some nonlinear capacity is misallocated.

// ANALYSIS

This is one of the more practically interesting efficiency papers this year because it targets the most expensive part of decoder inference without requiring a full architectural rewrite.

  • The key claim is context-dependent routing, not token lookup, which makes the method harder to cache statically but more realistic in live inference.
  • Gains are architecture-sensitive: GPT-2 responds well, while Pythia is tougher, so transfer to newer families is promising but unproven.
  • The reported perplexity improvement from partial linearization hints at regularization benefits, not just speedups.
  • If replicated on modern SwiGLU models, this could become a low-overhead optimization path for local and edge serving stacks.
// TAGS
half-the-nonlinearity-is-wastedllminferenceresearch

DISCOVERED

37d ago

2026-03-05

PUBLISHED

37d ago

2026-03-05

RELEVANCE

9/ 10

AUTHOR

Interesting_Meat_900