Two-time-scale stochastic approximation (SA) is an algorithm with coupled iterations which has found broad applications in reinforcement learning, optimization and game control. In this work, we derive mean squared error bounds for non-linear two-time-scale iterations with contractive mappings. In the setting where both stepsizes are order $Θ(1/k)$, commonly referred to as single time-scale SA with multiple coupled sequences, we obtain the first $O(1/k)$ rate without imposing additional smoothness assumptions. In the setting with true time-scale separation, the previous best bound was $O(1/k^{2/3})$. We improve this to $O(1/k^a)$ for any $a<1$ approaching the optimal $O(1/k)$ rate. The key step in our analysis involves rewriting the original iteration in terms of an averaged noise sequence whose variance decays sufficiently fast. Additionally, we use an induction-based approach to show that the iterates are bounded in expectation. Our results apply to Polyak averaging, as well as to algorithms from reinforcement learning, and optimization, including gradient descent-ascent and two-time-scale Lagrangian optimization.
翻译:双时间尺度随机逼近(SA)是一种具有耦合迭代的算法,在强化学习、优化和博弈控制中有着广泛的应用。在本工作中,我们推导了具有压缩映射的非线性双时间尺度迭代的均方误差界。在两种步长均为$Θ(1/k)$量级的设置下(通常被称为具有多个耦合序列的单时间尺度SA),我们在不施加额外光滑性假设的情况下,首次获得了$O(1/k)$的收敛速率。在具有真正时间尺度分离的设置中,先前的最佳界为$O(1/k^{2/3})$。我们将其改进为对任意$a<1$的$O(1/k^a)$,从而逼近最优的$O(1/k)$速率。我们分析的关键步骤在于将原始迭代用其方差衰减足够快的平均噪声序列来重写。此外,我们采用了一种基于归纳法的方法来证明迭代序列在期望意义下是有界的。我们的结果适用于Polyak平均,也适用于来自强化学习、优化的算法,包括梯度下降-上升法和双时间尺度拉格朗日优化。