Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning

The stochastic approximation (SA) algorithm is a widely used probabilistic method for finding a zero or a fixed point of a vector-valued funtion, when only noisy measurements of the function are available. In the literature to date, one makes a distinction between ``synchronous'' updating, whereby every component of the current guess is updated at each time, and ``asynchronous'' updating, whereby only one component is updated. In this paper, we study an intermediate situation that we call ``batch asynchronous stochastic approximation'' (BASA), in which, at each time instant, \textit{some but not all} components of the current estimated solution are updated. BASA allows the user to trade off memory requirements against time complexity. We develop a general methodology for proving that such algorithms converge to the fixed point of the map under study. These convergence proofs make use of weaker hypotheses than existing results. Specifically, existing convergence proofs require that the measurement noise is a zero-mean i.i.d\ sequence or a martingale difference sequence. In the present paper, we permit biased measurements, that is, measurement noises that have nonzero conditional mean. Also, all convergence results to date assume that the stochastic step sizes satisfy a probabilistic analog of the well-known Robbins-Monro conditions. We replace this assumption by a purely deterministic condition on the irreducibility of the underlying Markov processes. As specific applications to Reinforcement Learning, we analyze the temporal difference algorithm $TD(\lambda)$ for value iteration, and the $Q$-learning algorithm for finding the optimal action-value function. In both cases, we establish the convergence of these algorithms, under milder conditions than in the existing literature.

翻译：随机近似（SA）算法是一种广泛使用的概率方法，用于在仅能获得函数含噪测量值时寻找向量值函数的零点或不动点。在现有文献中，区分了“同步”更新（即每次更新当前估计的每个分量）和“异步”更新（即仅更新一个分量）。本文研究一种中间情形，我们称之为“批量异步随机近似”（BASA），即在每个时刻，当前估计解的某些（而非全部）分量被更新。BASA允许用户根据内存需求与时间复杂性进行权衡。我们开发了一种通用方法论，用于证明此类算法收敛到所研究映射的不动点。这些收敛性证明使用了比现有结果更弱的假设。具体而言，现有的收敛性证明要求测量噪声是零均值独立同分布序列或鞅差序列。在本文中，我们允许有偏测量，即条件均值非零的测量噪声。此外，迄今为止所有收敛性结果均假设随机步长满足著名的Robbins-Monro条件的概率类比。我们将这一假设替换为关于底层马尔可夫过程不可约性的纯确定性条件。作为强化学习的具体应用，我们分析了用于值迭代的时间差分算法$TD(\lambda)$，以及用于寻找最优动作值函数的$Q$-学习算法。在两种情况下，我们均在比现有文献更温和的条件下建立了这些算法的收敛性。