We study the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the \(Q\)-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous \(Q\)-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.
翻译:我们研究了在折扣无限时域强化学习(RL)设定中,存在对抗性腐败奖励时学习最优策略的问题。为解决此问题,我们提出了一种新颖的鲁棒变体\(Q\)-学习算法,并分析了其在具有时间相关数据的挑战性异步采样模型下的表现。尽管存在腐败,我们证明该方法有限时间保证与现有界相匹配,仅附加一个与腐败样本比例相关的加性项。我们还建立了信息论下界,揭示我们的保证是近似最优的。值得注意的是,我们的算法对底层奖励分布不可知,并为异步\(Q\)-学习提供了首个有限时间鲁棒性保证。我们分析的关键要素是对近鞅的精细Azuma-Hoeffding不等式,该不等式可能在RL算法研究中具有更广泛的适用性。