Recently, there has been a surge of interest in analyzing the non-asymptotic behavior of model-free reinforcement learning algorithms. However, the performance of such algorithms in non-ideal environments, such as in the presence of corrupted rewards, is poorly understood. Motivated by this gap, we investigate the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model, where an adversary can arbitrarily perturb a small fraction of the observed rewards. We start by proving that such an attack can cause the vanilla Q-learning algorithm to incur arbitrarily large errors. We then develop a novel robust synchronous Q-learning algorithm that uses historical reward data to construct robust empirical Bellman operators at each time step. Finally, we prove a finite-time convergence rate for our algorithm that matches known state-of-the-art bounds (in the absence of attacks) up to a small inevitable $O(\varepsilon)$ error term that scales with the adversarial corruption fraction $\varepsilon$. Notably, our results continue to hold even when the true reward distributions have infinite support, provided they admit bounded second moments.
翻译:近年来,对无模型强化学习算法的非渐近行为分析引起了广泛关注。然而,此类算法在非理想环境(例如存在奖励污染的情况下)的性能尚未得到充分理解。基于这一研究空白,我们研究了经典Q学习算法对强污染攻击模型的鲁棒性,在该模型中,对手可以任意扰动一小部分观测到的奖励。我们首先证明此类攻击可能导致标准Q学习算法产生任意大的误差。随后,我们提出了一种新颖的鲁棒同步Q学习算法,该算法利用历史奖励数据在每个时间步构建鲁棒的经验贝尔曼算子。最后,我们证明了该算法的有限时间收敛速率,在排除不可避免的$O(\varepsilon)$误差项(与对抗污染比例$\varepsilon$成正比)后,与已知的最优边界(无攻击情况下)相匹配。值得注意的是,即使真实奖励分布具有无限支撑集,只要其二阶矩有界,我们的结论仍然成立。