Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption

This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL), where the transition dynamics can be corrupted by an adversary. Existing studies on corruption-robust RL mostly focus on the setting of model-free RL, where robust least-square regression is often employed for value function estimation. However, these techniques cannot be directly applied to model-based RL. In this paper, we focus on model-based RL and take the maximum likelihood estimation (MLE) approach to learn transition model. Our work encompasses both online and offline settings. In the online setting, we introduce an algorithm called corruption-robust optimistic MLE (CR-OMLE), which leverages total-variation (TV)-based information ratios as uncertainty weights for MLE. We prove that CR-OMLE achieves a regret of $\tilde{\mathcal{O}}(\sqrt{T} + C)$, where $C$ denotes the cumulative corruption level after $T$ episodes. We also prove a lower bound to show that the additive dependence on $C$ is optimal. We extend our weighting technique to the offline setting, and propose an algorithm named corruption-robust pessimistic MLE (CR-PMLE). Under a uniform coverage condition, CR-PMLE exhibits suboptimality worsened by $\mathcal{O}(C/n)$, nearly matching the lower bound. To the best of our knowledge, this is the first work on corruption-robust model-based RL algorithms with provable guarantees.

翻译：本研究探讨了模型强化学习（RL）中对抗性扰动带来的挑战，其中状态转移动态可能受到对手的破坏。现有关于抗扰动鲁棒RL的研究主要集中于无模型RL场景，通常采用鲁棒最小二乘回归进行价值函数估计。然而，这些技术无法直接应用于模型强化学习。本文聚焦于模型强化学习，采用最大似然估计（MLE）方法学习转移模型。我们的工作涵盖在线与离线两种设置。在线设置中，我们提出了一种称为抗扰动乐观MLE（CR-OMLE）的算法，该算法利用基于总变差（TV）的信息比率作为MLE的不确定性权重。我们证明CR-OMLE能够实现$\tilde{\mathcal{O}}(\sqrt{T} + C)$的遗憾值，其中$C$表示$T$个回合后的累积扰动水平。同时我们通过下界证明表明对$C$的加法依赖是最优的。我们将权重技术扩展至离线设置，提出了名为抗扰动悲观MLE（CR-PMLE）的算法。在均匀覆盖条件下，CR-PMLE表现出$\mathcal{O}(C/n)$的次优性恶化，几乎匹配理论下界。据我们所知，这是首个具有可证明保证的抗扰动模型强化学习算法研究。

相关内容

极大似然估计

关注 5

极大似然估计方法（Maximum Likelihood Estimate，MLE）也称为最大概似估计或最大似然估计，是求估计的另一种方法，最大概似是1821年首先由德国数学家高斯（C. F. Gauss）提出，但是这个方法通常被归功于英国的统计学家罗纳德·费希尔（R. A. Fisher）它是建立在极大似然原理的基础上的一个统计方法，极大似然原理的直观想法是，一个随机试验如有若干个可能的结果A，B，C，... ，若在一次试验中，结果A出现了，那么可以认为实验条件对A的出现有利，也即出现的概率P(A)较大。极大似然原理的直观想法我们用下面例子说明。设甲箱中有99个白球，1个黑球；乙箱中有1个白球．99个黑球。现随机取出一箱，再从抽取的一箱中随机取出一球，结果是黑球，这一黑球从乙箱抽取的概率比从甲箱抽取的概率大得多，这时我们自然更多地相信这个黑球是取自乙箱的。一般说来，事件A发生的概率与某一未知参数theta有关， theta取值不同，则事件A发生的概率P(A/theta)也不同，当我们在一次试验中事件A发生了，则认为此时的theta值应是t的一切可能取值中使P(A/theta)达到最大的那一个，极大似然估计法就是要选取这样的t值作为参数t的估计值，使所选取的样本在被选的总体中出现的可能性为最大。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日