BACK_TO_FEEDAICRIER_2
Shifted Key GDN trims params, keeps pace
OPEN_SOURCE ↗
REDDIT · REDDIT// 8d agoBENCHMARK RESULT

Shifted Key GDN trims params, keeps pace

A Reddit experiment reports that Gated Delta Net can drop learned Q/K projections and instead use the current hidden state as query and the previous hidden state as key. On undertrained coding runs, the shifted-key variant slightly improves loss while using about 12.5% to 25% fewer layer parameters.

// ANALYSIS

The interesting part is not just the parameter cut; it is that the inductive bias seems to fit linear-attention-style memory better than softmax attention. In other words, GDN may be getting more mileage from state geometry and token-to-token chaining than from learned query/key transforms.

  • The reported gain is small but meaningful: better or equal loss with fewer parameters and faster convergence suggests Q/K projections were partly redundant at this scale.
  • The effect not transferring to softmax attention is the key clue; exact key lookup and normalized retrieval likely depend more on explicit query/key separation there.
  • The shifted-key setup bakes in a strong local continuity prior, which may help linear memory blocks because the hidden state already carries enough context to serve as both selector and address.
  • The results are still narrow: one dataset, undertrained runs, and a specific architecture family, so this is a useful architectural signal rather than proof of a general rule.
  • If it holds up at larger scales, it argues for simpler linear-attention blocks that spend capacity on memory dynamics instead of projection layers.
// TAGS
shifted-key-gated-delta-netgated-delta-netllmbenchmarkresearchopen-source

DISCOVERED

8d ago

2026-04-04

PUBLISHED

8d ago

2026-04-04

RELEVANCE

8/ 10

AUTHOR

jfguan