Revisiting Step-Size Assumptions in Stochastic Approximation

Many machine learning and optimization algorithms are built upon the framework of stochastic approximation (SA), for which the selection of step-size (or learning rate) is essential for success. For the sake of clarity, this paper focuses on the special case $\alpha_n = \alpha_0 n^{-\rho}$ at iteration $n$, with $\rho \in [0,1]$ and $\alpha_0>0$ design parameters. It is most common in practice to take $\rho=0$ (constant step-size), while in more theoretically oriented papers a vanishing step-size is preferred. In particular, with $\rho \in (1/2, 1)$ it is known that on applying the averaging technique of Polyak and Ruppert, the mean-squared error (MSE) converges at the optimal rate of $O(1/n)$ and the covariance in the central limit theorem (CLT) is minimal in a precise sense. The paper revisits step-size selection in a general Markovian setting. Under readily verifiable assumptions, the following conclusions are obtained provided $0<\rho<1$: $\bullet$ Parameter estimates converge with probability one, and also in $L_p$ for any $p\ge 1$. $\bullet$ The MSE may converge very slowly for small $\rho$, of order $O(\alpha_n^2)$ even with averaging. $\bullet$ For linear stochastic approximation the source of slow convergence is identified: for any $\rho\in (0,1)$, averaging results in estimates for which the error $\textit{covariance}$ vanishes at the optimal rate, and moreover the CLT covariance is optimal in the sense of Polyak and Ruppert. However, necessary and sufficient conditions are obtained under which the $\textit{bias}$ converges to zero at rate $O(\alpha_n)$. This is the first paper to obtain such strong conclusions while allowing for $\rho \le 1/2$. A major conclusion is that the choice of $\rho =0$ or even $\rho<1/2$ is justified only in select settings -- In general, bias may preclude fast convergence.

翻译：许多机器学习和优化算法建立在随机逼近（SA）框架之上，其中步长（或学习率）的选择对算法性能至关重要。为清晰起见，本文聚焦于迭代次数$n$时的特殊形式$\alpha_n = \alpha_0 n^{-\rho}$，其中$\rho \in [0,1]$和$\alpha_0>0$为设计参数。实践中最常采用$\rho=0$（恒定步长），而理论研究中更倾向于使用衰减步长。特别地，当$\rho \in (1/2, 1)$时，应用Polyak和Ruppert的平均技术可使均方误差（MSE）以最优速率$O(1/n)$收敛，且中心极限定理（CLT）中的协方差在精确意义下达到最小。本文在一般马尔可夫框架下重新审视步长选择问题。在易于验证的假设条件下，当$0<\rho<1$时获得以下结论：$\bullet$ 参数估计以概率1收敛，且对任意$p\ge 1$在$L_p$空间收敛；$\bullet$ 当$\rho$较小时，即使采用平均技术，MSE收敛速度可能极慢，仅为$O(\alpha_n^2)$量级；$\bullet$ 对于线性随机逼近，本文明确了慢收敛的根源：对任意$\rho\in (0,1)$，平均化估计的误差协方差能以最优速率衰减，且CLT协方差满足Polyak-Ruppert意义下的最优性。然而，本文获得了偏差以$O(\alpha_n)$速率收敛至零的充分必要条件。这是首篇在允许$\rho \le 1/2$的情况下获得如此强结论的研究。主要结论表明：仅在某些特定场景下才适合选择$\rho =0$甚至$\rho<1/2$——一般而言，偏差的存在可能阻碍快速收敛。