ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

Neural network curvature exponent decomposed: why convolutions differ from transformers | UncensoredHub

Research

Neural network curvature exponent decomposed: why convolutions differ from transformers

New preprint proves curvature exponent α—governing how Hessian eigenvalues scale with gradient singular values—decomposes into a geometric alignment term, explaining why convolutions show α≈2 while transformer attention and MLP layers differ.

ByAlex Sokoloff·June 3, 2026

Neural network curvature exponent decomposed: why convolutions differ from transformers

A new preprint proves why neural network layers exhibit systematically different curvature exponents—the mathematical quantity governing how Hessian eigenvalues scale with gradient singular values. The curvature exponent α follows the relationship h_k ∝ σ_k^α, and empirically varies by layer type: convolutional layers show α≈2, transformer attention layers show α≈1, and MLP up-projections fall below 1. Posted to arXiv on June 3, the paper derives an exact decomposition: α = 2 + d log Φ_k / d log σ_k, where Φ_k measures alignment between Kronecker factor eigenbases and gradient singular directions.

This decomposition reframes "why does α vary?" as a geometric question about layer-specific structures. The authors analyze how LayerNorm, residual connections, and softmax heads affect spectral alignment, then derive a spectral transfer identity linking three quantities: the curvature exponent α, effective gradient rank-decay γ, and Hessian decay exponent s. The relationship s = αγ is algebraic, but its predictive power is empirical. Fitting α and γ independently on different data (Hessian-vector products vs. singular value decomposition) recovers s to ~2% median error across 93 layers, five architectures, and three datasets—with no free parameters. A zeta-function bound on participation ratio shows curvature concentrates onto effectively one direction per layer.

Preconditioner design

The authors apply the decomposition to optimization. They derive an architecture-adaptive preconditioner T(σ;α) and implement it in the gradient singular basis as Spectral Newton. On vision benchmarks where α≈2, Spectral Newton outperforms AdamW. The spectral transfer identity holds across ResNets, Vision Transformers, and other standard architectures tested on ImageNet, CIFAR-10, and CIFAR-100, suggesting the decomposition captures a fundamental property of how different layer types shape the loss landscape.

ZenCreator

Neural network curvature exponent decomposed: why convolutions differ from transformers

Preconditioner design

More in Research

ShortOPD cuts pruned LLM recovery time by 75% while raising generation quality 9×

Claude Design launches as Anthropic Labs visual collaboration tool

Apple accuses OpenAI of soliciting hardware prototypes in job interviews

Lightweight proxy models cut LLM post-training costs while enabling cross-model signal reuse

Colibri runs 744B GLM-5.2 on 25GB RAM by streaming experts from disk