Muon sticks with transformers, skips ConvNets

// 103d agoNEWS

Muon sticks with transformers, skips ConvNets

Muon is the open-source optimizer for hidden layers that gained traction in LLM training, but its public usage still clusters around transformer-shaped weights. The Reddit thread is basically asking why a CIFAR-10 speed record hasn’t turned into broad ConvNet adoption.

// ANALYSIS

Muon’s transformer-first footprint is less mysterious than it looks: it is explicitly designed for 2D hidden-layer weights, while embeddings, biases, and other non-matrix parameters stay on AdamW. The bigger issue is not whether it can help vision models, but whether the gains are strong, repeatable, and worth the tuning cost outside LLMs.

–The official docs describe Muon as a hidden-layer optimizer and say ConvNets should use it only on convolutional filters, not as a blanket replacement.
–Transformers are the cleanest match because most of the expensive parameters are matrix-shaped, so the orthogonalized-update idea maps naturally onto the model.
–The CIFAR-10 speed record proves Muon can help on vision benchmarks, but one fast training run is not the same as broad evidence across modern CNNs or ViTs.
–LLM training gets the attention because the compute budgets are massive, so even small optimizer gains matter; in smaller vision workloads, the ROI is harder to justify.
–If Muon expands beyond transformers, it will likely be as part of hybrid optimizer stacks rather than a universal AdamW substitute.

// TAGS

muonllmresearchbenchmarkopen-source

DISCOVERED

103d ago

2026-03-31

PUBLISHED

103d ago

2026-03-31

RELEVANCE

8/ 10

AUTHOR

lukeiy

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

OPEN SOURCE19m ago

prose stylesheet forces clean AI writing

prose is a lightweight, single-file Markdown prompt configuration that guides AI coding agents to communicate like a direct, confident senior engineer. Appended directly to local agent instruction files, it establishes clear rules to eliminate common AI patterns like cheesy setups, over-bulleted reasoning, and theatrical language.

MODEL3h ago

Reve 2.1 drops native 4K rendering

Reve has released version 2.1 of its creative image generation model, introducing native 4K rendering, object-level editing, and a new "Live Layers" feature. The update enables users to perform localized edits and manage layouts directly, catering to professional design workflows requiring precise control.

OPEN SOURCE3h ago

ABot-World simulates infinite 720p worlds on single GPU

ABot-World is an open-source, action-conditioned infinite world simulator designed to generate interactive 720p environments at 16 frames per second with low latency on a single desktop GPU. By utilizing an NVIDIA RTX 5090 and requiring just 19GB of GPU memory, this embodied world model offers physical compliance, action controllability, and zero-shot generalization, making real-time, interactive environment simulation accessible on consumer-grade hardware.