OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT
Shifted Key GDN trims params, keeps pace
A Reddit experiment reports that Gated Delta Net can drop learned Q/K projections and instead use the current hidden state as query and the previous hidden state as key. On undertrained coding runs, the shifted-key variant slightly improves loss while using about 12.5% to 25% fewer layer parameters.
// ANALYSIS
The interesting part is not just the parameter cut; it is that the inductive bias seems to fit linear-attention-style memory better than softmax attention. In other words, GDN may be getting more mileage from state geometry and token-to-token chaining than from learned query/key transforms.
- –The reported gain is small but meaningful: better or equal loss with fewer parameters and faster convergence suggests Q/K projections were partly redundant at this scale.
- –The effect not transferring to softmax attention is the key clue; exact key lookup and normalized retrieval likely depend more on explicit query/key separation there.
- –The shifted-key setup bakes in a strong local continuity prior, which may help linear memory blocks because the hidden state already carries enough context to serve as both selector and address.
- –The results are still narrow: one dataset, undertrained runs, and a specific architecture family, so this is a useful architectural signal rather than proof of a general rule.
- –If it holds up at larger scales, it argues for simpler linear-attention blocks that spend capacity on memory dynamics instead of projection layers.
// TAGS
shifted-key-gated-delta-netgated-delta-netllmbenchmarkresearchopen-source
DISCOVERED
8d ago
2026-04-04
PUBLISHED
8d ago
2026-04-04
RELEVANCE
8/ 10
AUTHOR
jfguan