OPEN_SOURCE ↗
YT · YOUTUBE// 31d agoRESEARCH PAPER
Grow, Don’t Overwrite curbs catastrophic forgetting
This paper proposes a function-preserving way to expand transformer MLP layers during fine-tuning by duplicating up-projection weights and compensating in the down-projection layer. On Gemma models, it reports downstream performance comparable to standard fine-tuning while preserving much more of the base model’s original capabilities.
// ANALYSIS
This is a sharp continual-learning result because it solves forgetting by adding reusable capacity instead of treating retention as a regularization tax.
- –The core method is elegant: copy the MLP up-projection, scale the down-projection, and keep the expanded network mathematically identical to the original at initialization so training stays stable.
- –The paper shows the clearest gains on high-shift tasks like translation and entailment, where standard fine-tuning erases prior capabilities but the growth-based variants preserve them.
- –It is more practical than it first sounds: growing all layers trains roughly 60% of the original parameter count, and growing only 9-10 targeted layers gets close to full performance at roughly 30%.
- –The main limitation is scope: the experiments are centered on MLP growth in transformer models, and harder reasoning tasks like MathQA still benefit from a less frozen variant.
// TAGS
grow-dont-overwritefine-tuningllmresearch
DISCOVERED
31d ago
2026-03-11
PUBLISHED
31d ago
2026-03-11
RELEVANCE
9/ 10
AUTHOR
Discover AI