We initiate the study of multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system extending recent results for the special case of stochastic bandits. We provide a framework which modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on "optimism in the face of uncertainty", by complementing them with principles from "action elimination". Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms which (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels corruption, enjoying regret guarantees which degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) as well as linear-function-approximation settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
翻译:我们首次研究了在奖励和系统转移概率均遭受对抗性腐败的多阶段回合制强化学习问题,将近期关于随机赌博机特例的研究成果进行了推广。我们提出了一种框架,该框架通过引入"动作消除"原则,对现有基于"面对不确定性保持乐观"的强化学习方法中激进的探索策略进行改进。重要的是,该框架规避了在强化学习场景中朴素应用动作消除所面临的主要挑战——这一点已通过我们证明的下界得以形式化。该框架产生了高效算法,其(a)在无腐败情况下能达到近最优的遗憾界,(b)能自适应未知的腐败水平,且遗憾保证随总腐败量呈优雅退化。为展示方法的普适性,我们推导了表格设置(状态与动作有限)和线性函数近似设置(动态与奖励具有线性潜在表示)的结果。值得注意的是,我们的工作首次在回合制强化学习的赌博机反馈模型下,提供了能适应任意非独立同分布转移的次线性遗憾保证。