BACK_TO_FEEDAICRIER_2
Kimi Attention Residuals rethink Transformer skip connections
OPEN_SOURCE ↗
REDDIT · REDDIT// 26d agoRESEARCH PAPER

Kimi Attention Residuals rethink Transformer skip connections

Moonshot AI’s Kimi team proposes Attention Residuals, replacing fixed residual accumulation with learned depth-wise softmax attention so each layer can selectively pull from earlier representations. In repo-reported results, Block AttnRes acts as a practical drop-in variant, matching baseline loss with about 1.25x compute efficiency gains and improving downstream scores when integrated into Kimi Linear (48B total, 3B activated).

// ANALYSIS

This is a smart architectural bet: upgrade the least-changed part of Transformers without forcing a full stack rewrite.

  • The key value is selectivity across depth, which directly targets PreNorm-style signal dilution in very deep models.
  • Block AttnRes keeps the idea deployable by trading full cross-layer attention for compressed block-level routing.
  • Reported latency overhead (<2% inference) is low enough to make real-world adoption plausible if independent replications hold.
  • The biggest open question is external validation across non-Kimi model families and larger-scale training regimes.
// TAGS
attention-residualskimi-linearllmtransformerresearchinferenceopen-source

DISCOVERED

26d ago

2026-03-17

PUBLISHED

26d ago

2026-03-16

RELEVANCE

9/ 10

AUTHOR

nekofneko