Newton's method beats gradient descent on high-frequency data in overparameterized neural networks
A new convergence proof shows regularized Newton training converges exponentially fast in infinite-width neural networks, with uniform spectral coverage that sidesteps the frequency bias plaguing gradient descent.
A convergence analysis published this week on arXiv demonstrates that regularized Newton's method trains overparameterized neural networks exponentially fast to zero loss, addressing the spectral bias that slows gradient descent on high-frequency target data. The paper introduces a "Newton neural tangent kernel" (NNTK) and proves that as hidden units approach infinity, training dynamics converge in probability to a deterministic limit equation with explicit convergence rates.
The core advantage lies in the eigenvalue structure. The standard neural tangent kernel for gradient descent accumulates eigenvalues near zero, starving high-frequency components of learning signal. The NNTK maintains uniformly lower-bounded eigenvalues when the regularization parameter is chosen correctly, allowing Newton updates to converge uniformly across the frequency spectrum. The authors derive a scaling formula for the regularization parameter that can vanish at a controlled rate as network width grows, ensuring the regularized Hessian stays positive definite throughout training even though the raw Hessian may be indefinite.
The analysis navigates two technical obstacles: implicit parameter updates with a potentially indefinite Hessian, and a linear system whose dimension explodes with network width. The proof shows that for sufficiently wide networks, individual parameter updates shrink toward zero and the model behaves as a linearization around initialization—meaning the infinite-width limit is tractable and the finite-width dynamics provably converge to it.
The immediate question is whether the scaling formula translates to practical speedups on real tasks with high-frequency structure, and whether second-order methods can be made computationally viable at the widths where the theory guarantees convergence. Implementations that approximate the NNTK or use low-rank Hessian updates may close the gap between the infinite-width guarantee and deployable code.
