The Stochastic Approximation (SA) algorithm introduced by Robbins and Monro in 1951 has been a standard method for solving equations of the form $\mathbf{f}({\boldsymbol {\theta}}) = \mathbf{0}$, when only noisy measurements of $\mathbf{f}(\cdot)$ are available. If $\mathbf{f}({\boldsymbol {\theta}}) = \nabla J({\boldsymbol {\theta}})$ for some function $J(\cdot)$, then SA can also be used to find a stationary point of $J(\cdot)$. In much of the literature, it is assumed that the error term ${\boldsymbol {xi}}_{t+1}$ has zero conditional mean, and that its conditional variance is bounded as a function of $t$ (though not necessarily with respect to ${\boldsymbol {\theta}}_t$). Also, for the most part, the emphasis has been on ``synchronous'' SA, whereby, at each time $t$, \textit{every} component of ${\boldsymbol {\theta}}_t$ is updated. Over the years, SA has been applied to a variety of areas, out of which two are the focus in this paper: Convex and nonconvex optimization, and Reinforcement Learning (RL). As it turns out, in these applications, the above-mentioned assumptions do not always hold. In zero-order methods, the error neither has zero mean nor bounded conditional variance. In the present paper, we extend SA theory to encompass errors with nonzero conditional mean and/or unbounded conditional variance, and also asynchronous SA. In addition, we derive estimates for the rate of convergence of the algorithm. Then we apply the new results to problems in nonconvex optimization, and to Markovian SA, a recently emerging area in RL. We prove that SA converges in these situations, and compute the ``optimal step size sequences'' to maximize the estimated rate of convergence.
翻译:罗宾斯和门罗于1951年提出的随机逼近算法,已成为求解形如$\mathbf{f}({\boldsymbol {\theta}}) = \mathbf{0}$方程的标准方法,前提是仅能获得$\mathbf{f}(\cdot)$的含噪测量值。若存在某函数$J(\cdot)$满足$\mathbf{f}({\boldsymbol {\theta}}) = \nabla J({\boldsymbol {\theta}})$,则SA亦可用于寻找$J(\cdot)$的驻点。现有文献大多假设误差项${\boldsymbol {\xi}}_{t+1}$的条件均值为零,且其条件方差关于时间$t$有界(但未必关于${\boldsymbol {\theta}}_t$有界)。此外,研究重点多集中于“同步”SA,即每步迭代$t$同时更新${\boldsymbol {\theta}}_t$的所有分量。近年来,SA被广泛应用于多个领域,本文聚焦其中两个方向:凸与非凸优化,以及强化学习。这些应用中,上述假设未必成立。例如在零阶方法中,误差既不满足零均值条件,也不满足有界条件方差。本文拓展SA理论,使其能够处理非零条件均值误差和/或无界条件方差误差,以及异步SA。此外,我们推导了算法的收敛速率估计。随后将新结果应用于非凸优化问题及强化学中新兴的马尔可夫SA领域。我们证明在这些场景下SA仍能收敛,并计算出可最大化收敛速率估计的“最优步长序列”。