Muon optimizer's 2× speedup traced to spectral normalization of loss curvature

New arXiv preprint explains why Muon trains large language models twice as fast as Adam by equalizing update directions across the loss landscape.

ByAlex Sokoloff·June 13, 2026

Muon optimizer's 2× speedup traced to spectral normalization of loss curvature

Muon is an optimizer that trains large language models roughly twice as fast as Adam, and a preprint posted this week on arXiv explains the math behind the speedup. The paper, "Why Muon Outperforms Adam: A Curvature Perspective" by Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, and Zhuoran Yang, argues that Muon's advantage comes from spectral normalization—a matrix operation that scales parameter updates so that no single direction dominates the loss landscape. Adam, by contrast, can overweight certain directions, slowing convergence on complex surfaces.

The authors frame the result as a geometric explanation rather than an empirical one. By analyzing how data structure and model architecture shape local curvature, the paper offers a mathematical foundation for designing faster training algorithms. Spectral normalization ensures that all key directions in the parameter update matrix carry equal weight, preventing any single dimension from hijacking the optimization path. On the high-dimensional loss surfaces typical of billion-parameter models, this equalization translates directly into faster convergence.

Muon itself was introduced earlier this year and has been adopted by several community projects, though the original release did not include a rigorous curvature analysis. This preprint fills that gap, moving the optimizer from empirical observation to theoretical grounding. For practitioners, understanding how optimizer behavior interacts with model geometry could inform choices about layer width, attention head count, and learning rate schedules. The paper was posted June 12, 2026, as arXiv:2606.04662.

ZenCreator

Muon optimizer's 2× speedup traced to spectral normalization of loss curvature

More in Research

PAJAMA distills LLM judges into programs, cuts eval cost by 100×

Molt: NVIDIA's PyTorch framework cuts agentic RL iteration cost

Hypernetworks outscale LoRA for train-time knowledge injection in LLMs

Staleness-Adaptive Trust Region cuts asynchronous RL performance loss to 3% at 8× policy lag

Distilled RL transfers knowledge across model families without unconditional imitation