Loading…

Two-time-scale algorithm proves central limit theorem for robust reinforcement learning | UncensoredHub

ResearchNSFWPlatform

Two-time-scale algorithm proves central limit theorem for robust reinforcement learning

New preprint proposes Mean-Variance Stochastic Approximation for distributionally robust reinforcement learning, proving convergence and a central limit theorem with explicit asymptotic covariances in the small-ambiguity regime.

May 13, 2026

Two-time-scale algorithm proves central limit theorem for robust reinforcement learning

A preprint posted to arXiv this week tackles a longstanding computational bottleneck in distributionally robust reinforcement learning: the robust Bellman operator is nonlinear in the transition kernel, making standard one-sample updates biased and forcing expensive adversarial optimization at every evaluation step. Researchers propose an approximate DRRL framework that sidesteps both problems by working in the small-ambiguity regime under Kullback–Leibler divergence sets.

The key insight is a first-order expansion of the robust functional around the nominal transition kernel. This yields an approximate robust Bellman equation that removes the adversarial optimization entirely while remaining first-order accurate in the ambiguity radius. The result is a tractable fixed-point equation that can be learned with model-free, one-sample updates — a property the standard robust Bellman operator does not admit.

Algorithm and convergence

To learn the fixed point, the paper introduces Mean-Variance Stochastic Approximation (MVSA), a two-time-scale algorithm built on a lifted stochastic approximation dynamics. The main iterate tracks the value function, while a faster auxiliary iterate estimates the variance of the Bellman residual. The two-time-scale design lets MVSA use only one-sample transitions per update, avoiding the sample-complexity explosion that plagues exact robust methods.

The authors prove MVSA converges and satisfies a central limit theorem at the canonical n^−1/2 rate, with asymptotic covariances characterized explicitly. A numerical experiment validates the theoretical findings.

Algorithm and convergence

More in Research