Muon optimizer's 2× speedup traced to spectral normalization of loss curvature
New arXiv preprint explains why Muon trains large language models twice as fast as Adam by equalizing update directions across the loss landscape.

Muon is an optimizer that trains large language models roughly twice as fast as Adam, and a preprint posted this week on arXiv explains the math behind the speedup. The paper, "Why Muon Outperforms Adam: A Curvature Perspective" by Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, and Zhuoran Yang, argues that Muon's advantage comes from spectral normalization—a matrix operation that scales parameter updates so that no single direction dominates the loss landscape. Adam, by contrast, can overweight certain directions, slowing convergence on complex surfaces.
The authors frame the result as a geometric explanation rather than an empirical one. By analyzing how data structure and model architecture shape local curvature, the paper offers a mathematical foundation for designing faster training algorithms. Spectral normalization ensures that all key directions in the parameter update matrix carry equal weight, preventing any single dimension from hijacking the optimization path. On the high-dimensional loss surfaces typical of billion-parameter models, this equalization translates directly into faster convergence.
Muon itself was introduced earlier this year and has been adopted by several community projects, though the original release did not include a rigorous curvature analysis. This preprint fills that gap, moving the optimizer from empirical observation to theoretical grounding. For practitioners, understanding how optimizer behavior interacts with model geometry could inform choices about layer width, attention head count, and learning rate schedules. The paper was posted June 12, 2026, as arXiv:2606.04662.






