Gram Newton-Schulz accelerates Muon optimizer
Gram Newton-Schulz is a hardware-aware optimization of the standard Newton-Schulz algorithm, designed to remove the orthogonalization bottleneck in the Muon optimizer. By shifting computations to small symmetric Gram matrices and leveraging custom GPU kernels, it delivers 40–50% faster orthogonalization and significant FLOP savings without sacrificing model quality.
This implementation provides a "free lunch" performance boost for large-scale LLM training by shifting work from large rectangular matrices to small square Gram matrices, reducing FLOPs by up to 58%. Custom CUDA kernels optimized for NVIDIA Hopper and Blackwell architectures exploit symmetry to deliver 2x speedups over standard cuBLAS routines. The approach uses periodic restarts to mitigate numerical instability, making it a stable and strictly superior drop-in replacement for the Muon optimizer's bottleneck step.
DISCOVERED
11d ago
2026-03-31
PUBLISHED
11d ago
2026-03-31
RELEVANCE
AUTHOR
Benlus