Normalized momentum SGD converges under quadratic noise without bounded domains

New arXiv preprint proves normalized stochastic gradient descent with momentum reaches stationary points under Blum-Gladyshev noise—where gradient variance scales quadratically with distance from initialization—without bounded domains or growing batch sizes.

May 18, 2026

Normalized momentum SGD converges under quadratic noise without bounded domains

A preprint posted to arXiv on May 18 shows that normalized stochastic gradient descent with momentum can converge in nonconvex optimization even when noise variance grows quadratically with distance from the starting point. The paper proves an O(ε⁻⁶) oracle complexity bound under the Blum-Gladyshev (BG-0) noise model using only a single stochastic gradient per iteration. The result holds for both standard smooth objectives and α-symmetric generalized-smooth functions, where local curvature scales with the gradient norm.

The authors also introduce a variance-reduced normalized STORM method. Under mean-square smoothness with sharp initialization, the method achieves the minimax-optimal O(ε⁻⁴) rate—matching the known lower bound. When generalized smoothness couples with distance-dependent noise, complexity degrades to O(ε⁻⁽⁴⁺ᵅ⁾) for α ∈ (0,1) and O(ε⁻⁵) for α = 1. If the distance-growth parameter vanishes, the guarantees recover standard bounded-variance rates: O(ε⁻⁴) for momentum, O(ε⁻³) for variance reduction, and O(ε⁻²) in the deterministic case.

These are the first convergence guarantees for normalized methods in nonconvex stochastic optimization under BG-0 noise without requiring bounded domains, increasing batch sizes, or explicit anchoring. The analysis unifies standard and generalized smoothness regimes in a single framework. Empirical validation on large-scale neural network training will be the natural next step—whether the O(ε⁻⁶) rate for single-sample momentum translates to wall-clock speedups over adaptive methods like AdamW, and whether the variance-reduced STORM variant can close the gap to O(ε⁻⁴) without the sharp-initialization assumption, remains to be tested in follow-on work.

More in Research