OPEN_SOURCE ↗
YT · YOUTUBE// 11d agoRESEARCH PAPER
Kimi Team rethinks transformer residuals
Moonshot AI’s Kimi team proposes AttnRes, which replaces fixed residual summation with learned softmax attention over earlier layer outputs. The paper also adds Block AttnRes to cut memory and communication overhead, and reports gains when integrated into Kimi Linear.
// ANALYSIS
This is a real architectural swing, not a cosmetic tweak: it changes depth aggregation from blind accumulation to learned routing, which directly targets the hidden-state dilution problem in very deep transformers.
- –The core idea is elegant: let each layer decide which earlier representations matter instead of adding everything with equal weight.
- –Block AttnRes is the practical unlock; without it, full history attention across layers would be too expensive for large-scale training.
- –The Kimi Linear integration matters more than the abstract mechanism, because it shows the idea can survive contact with an actual training stack.
- –The big question is portability: if the gains depend on Moonshot’s model family or training recipe, adoption will stay niche; if not, this could become a new default design pattern for deep LLMs.
// TAGS
attention-residualskimillmreasoningresearch
DISCOVERED
11d ago
2026-04-01
PUBLISHED
11d ago
2026-04-01
RELEVANCE
9/ 10
AUTHOR
AI Search