The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning

The paper concerns the stochastic approximation recursion, \[ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) \,,\quad n\ge 0, \] where the {\em estimates} $\theta_n\in\Re^d$ and $ \{ \Phi_n \}$ is a Markov chain on a general state space. In addition to standard Lipschitz assumptions and conditions on the vanishing step-size sequence, it is assumed that the associated \textit{mean flow} $ \tfrac{d}{dt} \vartheta_t = \bar{f}(\vartheta_t)$, is globally asymptotically stable with stationary point denoted $\theta^*$, where $\bar{f}(\theta)=\text{ E}[f(\theta,\Phi)]$ with $\Phi$ having the stationary distribution of the chain. The main results are established under additional conditions on the mean flow and a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3) for the chain: (i) An appropriate Lyapunov function is constructed that implies convergence of the estimates in $L_4$. (ii) A functional CLT is established, as well as the usual one-dimensional CLT for the normalized error. Moment bounds combined with the CLT imply convergence of the normalized covariance $\text{ E} [ z_n z_n^T ]$ to the asymptotic covariance $\Sigma^\Theta$ in the CLT, where $z_n= (\theta_n-\theta^*)/\sqrt{\alpha_n}$. (iii) The CLT holds for the normalized version $z^{\text{ PR}}_n$ of the averaged parameters $\theta^{\text{ PR}}_n$, subject to standard assumptions on the step-size. Moreover, the normalized covariance of both $\theta^{\text{ PR}}_n$ and $z^{\text{ PR}}_n$ converge to $\Sigma^{\text{ PR}}$, the minimal covariance of Polyak and Ruppert. (iv)} An example is given where $f$ and $\bar{f}$ are linear in $\theta$, and the Markov chain is geometrically ergodic but does not satisfy (DV3). While the algorithm is convergent, the second moment of $\theta_n$ is unbounded and in fact diverges.

翻译：本文研究随机逼近递归式：\[ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) \,,\quad n\ge 0, \] 其中估计量$\theta_n\in\Re^d$，$\{ \Phi_n \}$为一般状态空间上的马尔可夫链。除标准Lipschitz假设和递减步长序列条件外，本文假设关联的\textit{平均流} $\tfrac{d}{dt} \vartheta_t = \bar{f}(\vartheta_t)$ 全局渐近稳定，其不动点记为$\theta^*$，其中$\bar{f}(\theta)=\text{ E}[f(\theta,\Phi)]$，$\Phi$服从链的平稳分布。主要结果在平均流的附加条件及链的Donsker-Varadhan Lyapunov漂移条件（DV3）下建立：(i) 构造适当的Lyapunov函数，证明估计量在$L_4$意义下收敛；(ii) 建立泛函中心极限定理及归一化误差的常规一维CLT。矩上界与CLT结合表明归一化协方差 $\text{ E} [ z_n z_n^T ]$ 收敛至CLT中的渐近协方差 $\Sigma^\Theta$，其中$z_n= (\theta_n-\theta^*)/\sqrt{\alpha_n}$；(iii) 在步长标准假设下，平均化参数$\theta^{\text{ PR}}_n$的归一化版本$z^{\text{ PR}}_n$满足CLT，且$\theta^{\text{ PR}}_n$与$z^{\text{ PR}}_n$的归一化协方差均收敛至Polyak-Ruppert最小协方差$\Sigma^{\text{ PR}}$；(iv) 给出示例说明：当$f$和$\bar{f}$关于$\theta$线性，且马尔可夫链几何遍历但不满足(DV3)时，算法虽收敛，但$\theta_n$的二阶矩无界且实际发散。