BACK_TO_FEEDAICRIER_2
Gram Newton-Schulz accelerates Muon optimizer
OPEN_SOURCE ↗
REDDIT · REDDIT// 11d agoRESEARCH PAPER

Gram Newton-Schulz accelerates Muon optimizer

Gram Newton-Schulz is a hardware-aware optimization of the standard Newton-Schulz algorithm, designed to remove the orthogonalization bottleneck in the Muon optimizer. By shifting computations to small symmetric Gram matrices and leveraging custom GPU kernels, it delivers 40–50% faster orthogonalization and significant FLOP savings without sacrificing model quality.

// ANALYSIS

This implementation provides a "free lunch" performance boost for large-scale LLM training by shifting work from large rectangular matrices to small square Gram matrices, reducing FLOPs by up to 58%. Custom CUDA kernels optimized for NVIDIA Hopper and Blackwell architectures exploit symmetry to deliver 2x speedups over standard cuBLAS routines. The approach uses periodic restarts to mitigate numerical instability, making it a stable and strictly superior drop-in replacement for the Muon optimizer's bottleneck step.

// TAGS
llmoptimizergpuresearchmuongram-newton-schulzmlops

DISCOVERED

11d ago

2026-03-31

PUBLISHED

11d ago

2026-03-31

RELEVANCE

9/ 10

AUTHOR

Benlus