Muon optimizer's final layers need double the Newton-Schulz steps, study finds
Spectral analysis of Muon momentum buffers across 77M–2.8B parameter models reveals final layers require 10-step orthonormalization while early layers stay stable on 5 steps, protecting large-scale pretraining from collapse.

A new preprint reveals that the Muon optimizer—already known for doubling compute efficiency over AdamW in large language model pretraining—can be tuned further by adjusting Newton-Schulz iteration counts per layer. Researchers tracked singular-value quantiles of momentum buffers across depths in models ranging from 77 million to 2.8 billion parameters and found that singular values follow power laws in double-log scale, with exponents that vary sharply by layer. Early and middle layers scale slowly and remain stable with the standard 5-step orthonormalization, while final layers scale aggressively and risk orthonormalization failure at state-of-the-art scales unless given 10 steps.
The practical recipe: keep the cheap 5-step scheme on most layers and apply the costlier 10-step iteration only to the deepest layers, preserving throughput while protecting quality. Muon adoption is climbing in state-of-the-art architectures, yet the standard configuration applies uniform iteration counts across all layers—a choice the paper calls "highly suboptimal." The spectral scaling laws are the first systematic look at Muon's momentum dynamics during pretraining, and the layer-specific tuning is backed by theory rather than grid search. Code is available at github.com/KellerJordan/modded-nanogpt, and the full paper—"Spectral Scaling Laws of Muon" by Gagik Magakyan, Pablo Parrilo, and Asuman Ozdaglar—was posted to arXiv on June 11, 2026.






