In this work, we study the concentration behavior of a stochastic approximation (SA) algorithm under a contractive operator with respect to an arbitrary norm. We consider two settings where the iterates are potentially unbounded: (1) bounded multiplicative noise, and (2) additive sub-Gaussian noise. We obtain maximal concentration inequalities on the convergence errors, and show that these errors have sub-Gaussian tails in the additive noise setting, and super-polynomial tails (faster than polynomial decay) in the multiplicative noise setting. In addition, we provide an impossibility result showing that it is in general not possible to achieve sub-exponential tails for SA with multiplicative noise. To establish these results, we develop a novel bootstrapping argument that involves bounding the moment generating function of the generalized Moreau envelope of the error and the construction of an exponential supermartingale to enable using Ville's maximal inequality. To demonstrate the applicability of our theoretical results, we use them to provide maximal concentration bounds for a large class of reinforcement learning algorithms, including but not limited to on-policy TD-learning with linear function approximation, off-policy TD-learning with generalized importance sampling factors, and $Q$-learning. To the best of our knowledge, super-polynomial concentration bounds for off-policy TD-learning have not been established in the literature due to the challenge of handling the combination of unbounded iterates and multiplicative noise.
翻译:本文研究了在任意范数下,收缩算子驱动的随机逼近(SA)算法的浓度行为。我们考虑两种迭代可能无界的情况:(1)有界乘性噪声,以及(2)加性次高斯噪声。我们获得了收敛误差的最大浓度不等式,并证明在加性噪声设置下这些误差具有次高斯尾部,而在乘性噪声设置下具有超多项式尾部(衰减速度快于多项式衰减)。此外,我们给出一个不可能性结果,表明对于含乘性噪声的SA,通常无法实现次指数尾部。为建立这些结果,我们提出了一种新的自举论证方法,包括:界定误差广义Moreau包络的矩生成函数,以及构造指数上鞅以应用Ville极大值不等式。为展示理论结果的适用性,我们将之用于为一大类强化学习算法提供最大浓度界,包括但不限于线性函数逼近的在策略TD学习、使用广义重要性采样因子的离策略TD学习以及Q学习。据我们所知,由于需要处理无界迭代与乘性噪声的结合,文献中尚未建立离策略TD学习的超多项式浓度界。