Neural network curvature exponent decomposed: why convolutions differ from transformers
New preprint proves curvature exponent α—governing how Hessian eigenvalues scale with gradient singular values—decomposes into a geometric alignment term, explaining why convolutions show α≈2 while transformer attention and MLP layers differ.

A new preprint proves why neural network layers exhibit systematically different curvature exponents—the mathematical quantity governing how Hessian eigenvalues scale with gradient singular values. The curvature exponent α follows the relationship h_k ∝ σ_k^α, and empirically varies by layer type: convolutional layers show α≈2, transformer attention layers show α≈1, and MLP up-projections fall below 1. Posted to arXiv on June 3, the paper derives an exact decomposition: α = 2 + d log Φ_k / d log σ_k, where Φ_k measures alignment between Kronecker factor eigenbases and gradient singular directions.
This decomposition reframes "why does α vary?" as a geometric question about layer-specific structures. The authors analyze how LayerNorm, residual connections, and softmax heads affect spectral alignment, then derive a spectral transfer identity linking three quantities: the curvature exponent α, effective gradient rank-decay γ, and Hessian decay exponent s. The relationship s = αγ is algebraic, but its predictive power is empirical. Fitting α and γ independently on different data (Hessian-vector products vs. singular value decomposition) recovers s to ~2% median error across 93 layers, five architectures, and three datasets—with no free parameters. A zeta-function bound on participation ratio shows curvature concentrates onto effectively one direction per layer.
Preconditioner design
The authors apply the decomposition to optimization. They derive an architecture-adaptive preconditioner T(σ;α) and implement it in the gradient singular basis as Spectral Newton. On vision benchmarks where α≈2, Spectral Newton outperforms AdamW. The spectral transfer identity holds across ResNets, Vision Transformers, and other standard architectures tested on ImageNet, CIFAR-10, and CIFAR-100, suggesting the decomposition captures a fundamental property of how different layer types shape the loss landscape.



