Muon optimizer's final layers need double the Newton-Schulz steps, study finds

Spectral analysis of Muon momentum buffers across 77M–2.8B parameter models reveals final layers require 10-step orthonormalization while early layers stay stable on 5 steps, protecting large-scale pretraining from collapse.

ByAlex Sokoloff·June 12, 2026

Muon optimizer's final layers need double the Newton-Schulz steps, study finds

A new preprint reveals that the Muon optimizer—already known for doubling compute efficiency over AdamW in large language model pretraining—can be tuned further by adjusting Newton-Schulz iteration counts per layer. Researchers tracked singular-value quantiles of momentum buffers across depths in models ranging from 77 million to 2.8 billion parameters and found that singular values follow power laws in double-log scale, with exponents that vary sharply by layer. Early and middle layers scale slowly and remain stable with the standard 5-step orthonormalization, while final layers scale aggressively and risk orthonormalization failure at state-of-the-art scales unless given 10 steps.

The practical recipe: keep the cheap 5-step scheme on most layers and apply the costlier 10-step iteration only to the deepest layers, preserving throughput while protecting quality. Muon adoption is climbing in state-of-the-art architectures, yet the standard configuration applies uniform iteration counts across all layers—a choice the paper calls "highly suboptimal." The spectral scaling laws are the first systematic look at Muon's momentum dynamics during pretraining, and the layer-specific tuning is backed by theory rather than grid search. Code is available at github.com/KellerJordan/modded-nanogpt, and the full paper—"Spectral Scaling Laws of Muon" by Gagik Magakyan, Pablo Parrilo, and Asuman Ozdaglar—was posted to arXiv on June 11, 2026.

ZenCreator

Muon optimizer's final layers need double the Newton-Schulz steps, study finds

More in Research

PAJAMA distills LLM judges into programs, cuts eval cost by 100×

Molt: NVIDIA's PyTorch framework cuts agentic RL iteration cost

Hypernetworks outscale LoRA for train-time knowledge injection in LLMs

Staleness-Adaptive Trust Region cuts asynchronous RL performance loss to 3% at 8× policy lag

Distilled RL transfers knowledge across model families without unconditional imitation