LM head bottleneck could throttle LLM training
This new arXiv paper argues that the language-model output head is not only an expressivity limit but also an optimization bottleneck during training. The authors report that 95-99% of gradient norm is suppressed at the output layer and show in controlled pretraining that this can slow convergence and make even simple patterns harder to learn as vocabulary size increases.
If this finding holds across large-scale runs, redesigning the LM head could become one of the highest-leverage ways to cut LLM training waste.
- –The work reframes the classic softmax bottleneck as a gradient-flow problem, not just an output-capacity issue.
- –The reported signal loss suggests current training pipelines may be spending compute on weak or noisy update directions.
- –Because the bottleneck sits at the final projection, improvements here could benefit many transformer families without changing core architectures.
- –It is still an early March 2026 preprint, so broader replication will be key before treating the gains as settled.
DISCOVERED
74d ago
2026-03-14
PUBLISHED
75d ago
2026-03-13
RELEVANCE
AUTHOR
141_1337