Temporal-difference (TD) learning is widely regarded as one of the most popular algorithms in reinforcement learning (RL). Despite its widespread use, it has only been recently that researchers have begun to actively study its finite time behavior, including the finite time bound on mean squared error and sample complexity. On the empirical side, experience replay has been a key ingredient in the success of deep RL algorithms, but its theoretical effects on RL have yet to be fully understood. In this paper, we present a simple decomposition of the Markovian noise terms and provide finite-time error bounds for TD-learning with experience replay. Specifically, under the Markovian observation model, we demonstrate that for both the averaged iterate and final iterate cases, the error term induced by a constant step-size can be effectively controlled by the size of the replay buffer and the mini-batch sampled from the experience replay buffer.
翻译:时间差分(TD)学习被广泛认为是强化学习(RL)中最流行的算法之一。尽管其应用广泛,但直到最近,研究者才开始积极研究其有限时间行为,包括均方误差的有限时间界和样本复杂度。在经验方面,经验回放已成为深度强化学习算法成功的关键要素,但其对强化学习的理论影响尚未被充分理解。本文通过马尔可夫噪声项的简单分解,为带经验回放的TD学习提供了有限时间误差界。具体而言,在马尔可夫观测模型下,我们证明了对于平均迭代和最终迭代两种情况,由恒定步长引起的误差项可以通过回放缓冲区的大小以及从经验回放缓冲区中采样的最小批(mini-batch)进行有效控制。