OPEN_SOURCE ↗
REDDIT · REDDIT// 10d agoRESEARCH PAPER
Clip to Grok hits 249x speedup
Researchers released an update to "Clip to Grok," a weight norm clipping technique that dramatically accelerates generalization in neural networks. By applying per-row L2 clipping to decoder weights after every optimizer step, the method eliminates "grokking delay" and achieves up to 249x speedup on modular arithmetic and non-abelian permutation tasks.
// ANALYSIS
Weight norm clipping is the "hard" regularization that weight decay always wanted to be.
- –Replaces slow, "soft" weight decay with a rigorous per-row L2 clipping that forces models into the "generalization zone" instantly.
- –Dramatically reduces "grokking delay" by preventing models from staying in high-norm memorization regimes.
- –Implementation is trivial (few lines of PyTorch) and has already been integrated by community stalwarts like lucidrains in fast-weight-attention.
- –Shows massive synergy with sign-based optimizers like Lion, suggesting a new primitive for fast-generalizing training loops.
- –Findings reveal that optimal max_norm correlates with algebraic complexity, with non-abelian tasks requiring tighter constraints (1.0) than modular addition (2.0).
// TAGS
clip-to-grokllmfine-tuningresearchopen-source
DISCOVERED
10d ago
2026-04-02
PUBLISHED
10d ago
2026-04-01
RELEVANCE
8/ 10
AUTHOR
niftylius