Two-time-scale algorithm proves central limit theorem for robust reinforcement learning
New preprint proposes Mean-Variance Stochastic Approximation for distributionally robust reinforcement learning, proving convergence and a central limit theorem with explicit asymptotic covariances in the small-ambiguity regime.
A preprint posted to arXiv this week tackles a longstanding computational bottleneck in distributionally robust reinforcement learning: the robust Bellman operator is nonlinear in the transition kernel, making standard one-sample updates biased and forcing expensive adversarial optimization at every evaluation step. Researchers propose an approximate DRRL framework that sidesteps both problems by working in the small-ambiguity regime under Kullback–Leibler divergence sets.
The key insight is a first-order expansion of the robust functional around the nominal transition kernel. This yields an approximate robust Bellman equation that removes the adversarial optimization entirely while remaining first-order accurate in the ambiguity radius. The result is a tractable fixed-point equation that can be learned with model-free, one-sample updates — a property the standard robust Bellman operator does not admit.
Algorithm and convergence
To learn the fixed point, the paper introduces Mean-Variance Stochastic Approximation (MVSA), a two-time-scale algorithm built on a lifted stochastic approximation dynamics. The main iterate tracks the value function, while a faster auxiliary iterate estimates the variance of the Bellman residual. The two-time-scale design lets MVSA use only one-sample transitions per update, avoiding the sample-complexity explosion that plagues exact robust methods.
The authors prove MVSA converges and satisfies a central limit theorem at the canonical n^−1/2 rate, with asymptotic covariances characterized explicitly. A numerical experiment validates the theoretical findings.
