This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL), where the transition dynamics can be corrupted by an adversary. Existing studies on corruption-robust RL mostly focus on the setting of model-free RL, where robust least-square regression is often employed for value function estimation. However, these techniques cannot be directly applied to model-based RL. In this paper, we focus on model-based RL and take the maximum likelihood estimation (MLE) approach to learn transition model. Our work encompasses both online and offline settings. In the online setting, we introduce an algorithm called corruption-robust optimistic MLE (CR-OMLE), which leverages total-variation (TV)-based information ratios as uncertainty weights for MLE. We prove that CR-OMLE achieves a regret of $\tilde{\mathcal{O}}(\sqrt{T} + C)$, where $C$ denotes the cumulative corruption level after $T$ episodes. We also prove a lower bound to show that the additive dependence on $C$ is optimal. We extend our weighting technique to the offline setting, and propose an algorithm named corruption-robust pessimistic MLE (CR-PMLE). Under a uniform coverage condition, CR-PMLE exhibits suboptimality worsened by $\mathcal{O}(C/n)$, nearly matching the lower bound. To the best of our knowledge, this is the first work on corruption-robust model-based RL algorithms with provable guarantees.
翻译:本研究探讨了模型驱动强化学习(RL)中遭遇对抗性腐败的挑战,其中转移动力学可能受到对手的篡改。现有针对脆弱性鲁棒的RL研究主要聚焦于无模型强化学习场景,常采用稳健最小二乘回归进行价值函数估计。然而,这些技术无法直接应用于模型驱动RL。本文聚焦模型驱动RL,采用最大似然估计(MLE)方法学习转移模型,涵盖在线和离线两种场景。在在线场景中,我们提出名为抗腐败乐观MLE(CR-OMLE)的算法,该算法利用基于总变异(TV)的信息比率作为MLE的不确定性权重。我们证明CR-OMLE的遗憾值为$\tilde{\mathcal{O}}(\sqrt{T} + C)$,其中$C$表示经过T回合后的累积腐败水平。同时我们给出下界证明,表明对$C$的加性依赖是最优的。我们将加权技术扩展到离线场景,提出名为抗腐败悲观MLE(CR-PMLE)的算法。在均匀覆盖条件下,CR-PMLE的次优性劣化程度为$\mathcal{O}(C/n)$,几乎匹配下界。据我们所知,这是首个具有可证性能保障的抗腐败模型驱动RL算法研究。