BACK_TO_FEEDAICRIER_2
Muon sticks with transformers, skips ConvNets
OPEN_SOURCE ↗
REDDIT · REDDIT// 12d agoNEWS

Muon sticks with transformers, skips ConvNets

Muon is the open-source optimizer for hidden layers that gained traction in LLM training, but its public usage still clusters around transformer-shaped weights. The Reddit thread is basically asking why a CIFAR-10 speed record hasn’t turned into broad ConvNet adoption.

// ANALYSIS

Muon’s transformer-first footprint is less mysterious than it looks: it is explicitly designed for 2D hidden-layer weights, while embeddings, biases, and other non-matrix parameters stay on AdamW. The bigger issue is not whether it can help vision models, but whether the gains are strong, repeatable, and worth the tuning cost outside LLMs.

  • The official docs describe Muon as a hidden-layer optimizer and say ConvNets should use it only on convolutional filters, not as a blanket replacement.
  • Transformers are the cleanest match because most of the expensive parameters are matrix-shaped, so the orthogonalized-update idea maps naturally onto the model.
  • The CIFAR-10 speed record proves Muon can help on vision benchmarks, but one fast training run is not the same as broad evidence across modern CNNs or ViTs.
  • LLM training gets the attention because the compute budgets are massive, so even small optimizer gains matter; in smaller vision workloads, the ROI is harder to justify.
  • If Muon expands beyond transformers, it will likely be as part of hybrid optimizer stacks rather than a universal AdamW substitute.
// TAGS
muonllmresearchbenchmarkopen-source

DISCOVERED

12d ago

2026-03-31

PUBLISHED

12d ago

2026-03-31

RELEVANCE

8/ 10

AUTHOR

lukeiy