BACK_TO_FEEDAICRIER_2
Affine Divergence reframes activation updates
OPEN_SOURCE ↗
REDDIT · REDDIT// 24d agoRESEARCH PAPER

Affine Divergence reframes activation updates

This paper argues that gradient descent updates parameters in the steepest-descent direction, but the resulting activation updates can be systematically misaligned. It turns that mismatch into two fixes: a new affine-like layer with built-in normalisation and a new convolution-friendly normaliser called PatchNorm.

// ANALYSIS

Hot take: this reads less like “yet another normaliser” and more like a geometry critique of how we train deep nets. If the theory holds up broadly, the interesting part is not scale invariance itself, but the idea that we have been stepping in the wrong activation-space direction all along.

  • The paper’s strongest claim is mechanistic: activation updates can be misaligned even when parameter updates are mathematically correct.
  • The new affine-like layer is notable because it keeps degrees of freedom instead of behaving like a standard normaliser, which makes it a cleaner architectural alternative for MLPs.
  • PatchNorm is the more exploratory idea, but it matters because it extends the same correction logic to convolution, where existing normalisation tricks are often bolted on rather than derived.
  • The reported batch-size effect is a useful falsifiable prediction: if larger batches hurt divergence-correcting layers, that is a sharp signal the mechanism is real, not just an optimization artifact.
  • If these results generalize, they could shift how we think about why LayerNorm, RMSNorm, and BatchNorm help, from “stabilize scale” to “repair update geometry.”
// TAGS
researchaffine-divergencepatchnorm

DISCOVERED

24d ago

2026-03-18

PUBLISHED

25d ago

2026-03-18

RELEVANCE

8/ 10

AUTHOR

GeorgeBird1