The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning

The paper concerns the $d$-dimensional stochastic approximation recursion, $$ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) $$ in which $\Phi$ is a geometrically ergodic Markov chain on a general state space $\textsf{X}$ with stationary distribution $\pi$, and $f:\Re^d\times\textsf{X}\to\Re^d$. The main results are established under a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3), and a stability condition for the mean flow with vector field $\bar{f}(\theta)=\textsf{E}[f(\theta,\Phi)]$, with $\Phi\sim\pi$. (i) $\{ \theta_n\}$ is convergent a.s. and in $L_4$ to the unique root $\theta^*$ of $\bar{f}(\theta)$. (ii) A functional CLT is established, as well as the usual one-dimensional CLT for the normalized error. (iii) The CLT holds for the normalized version, $z_n{=:} \sqrt{n} (\theta^{\text{PR}}_n -\theta^*)$, of the averaged parameters, $\theta^{\text{PR}}_n {=:} n^{-1} \sum_{k=1}^n\theta_k$, subject to standard assumptions on the step-size. Moreover, the normalized covariance converges, $$ \lim_{n \to \infty} n \textsf{E} [ {\widetilde{\theta}}^{\text{ PR}}_n ({\widetilde{\theta}}^{\text{ PR}}_n)^T ] = \Sigma_\theta^*,\;\;\;\textit{with $\widetilde{\theta}^{\text{ PR}}_n = \theta^{\text{ PR}}_n -\theta^*$,} $$ where $\Sigma_\theta^*$ is the minimal covariance of Polyak and Ruppert. (iv) An example is given where $f$ and $\bar{f}$ are linear in $\theta$, and the Markov chain $\Phi$ is geometrically ergodic but does not satisfy (DV3). While the algorithm is convergent, the second moment is unbounded: $ \textsf{E} [ \| \theta_n \|^2 ] \to \infty$ as $n\to\infty$.

翻译：本文研究$d$维随机逼近递归式：$$ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) $$其中$\Phi$是定义在一般状态空间$\textsf{X}$上、具有平稳分布$\pi$的几何遍历马尔可夫链，$f:\Re^d\times\textsf{X}\to\Re^d$。主要结果基于Donsker-Varadhan Lyapunov漂移条件（即(DV3)条件）以及向量场$\bar{f}(\theta)=\textsf{E}[f(\theta,\Phi)]$（$\Phi\sim\pi$）的均流稳定性条件而建立。结论包括：(i) $\{ \theta_n\}$以概率1且在$L_4$意义下收敛至$\bar{f}(\theta)$的唯一根$\theta^*$；(ii) 建立了函数型中心极限定理以及归一化误差的经典一维中心极限定理；(iii) 在步长标准假设下，参数均值$\theta^{\text{PR}}_n {=:} n^{-1} \sum_{k=1}^n\theta_k$的归一化版本$z_n{=:} \sqrt{n} (\theta^{\text{PR}}_n -\theta^*)$满足中心极限定理，且归一化协方差收敛：$$ \lim_{n \to \infty} n \textsf{E} [ {\widetilde{\theta}}^{\text{ PR}}_n ({\widetilde{\theta}}^{\text{ PR}}_n)^T ] = \Sigma_\theta^*,\;\;\;\text{其中} \widetilde{\theta}^{\text{ PR}}_n = \theta^{\text{ PR}}_n -\theta^* $$这里$\Sigma_\theta^*$为Polyak-Ruppert最小协方差；(iv) 给出反例：当$f$和$\bar{f}$关于$\theta$线性，且马尔可夫链$\Phi$满足几何遍历性但不满足(DV3)条件时，算法虽收敛，但二阶矩无界：$\textsf{E} [ \| \theta_n \|^2 ] \to \infty$（当$n\to\infty$）。