The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $\pi$. To do this, we need a model of how $\pi$ relates to $R$. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function $R$. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.
翻译:逆强化学习(Inverse Reinforcement Learning, IRL)的目标是从策略 $\pi$ 中推断奖励函数 $R$。为此,我们需要一个关于 $\pi$ 与 $R$ 之间关系的模型。在当前文献中,最常见的模型包括最优性、玻尔兹曼理性以及因果熵最大化。IRL的主要动机之一是从人类行为推断人类偏好。然而,人类偏好与人类行为之间的真实关系远比当前IRL所使用的任何模型都复杂。这意味着这些模型是错误设定的(misspecified),这引发了一个担忧:若将其应用于现实数据,可能导致不合理的推断。在本文中,我们从数学角度分析了不同IRL模型对模型误设定的鲁棒性,并精确回答了:在演示策略偏离各标准模型到何种程度之前,该模型仍不会导致对奖励函数 $R$ 的错误推断。我们还引入了一个用于推理IRL中模型误设定的框架,以及可用于轻松推导新IRL模型对模型误设定鲁棒性的形式化工具。