Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a reward function $R$) from their behaviour (represented as a policy $\pi$). To do this, we need a behavioural model of how $\pi$ relates to $R$. In the current literature, the most common behavioural models are optimality, Boltzmann-rationality, and causal entropy maximisation. However, the true relationship between a human's preferences and their behaviour is much more complex than any of these behavioural models. This means that the behavioural models are misspecified, which raises the concern that they may lead to systematic errors if applied to real data. In this paper, we analyse how sensitive the IRL problem is to misspecification of the behavioural model. Specifically, we provide necessary and sufficient conditions that completely characterise how the observed data may differ from the assumed behavioural model without incurring an error above a given threshold. In addition to this, we also characterise the conditions under which a behavioural model is robust to small perturbations of the observed policy, and we analyse how robust many behavioural models are to misspecification of their parameter values (such as e.g.\ the discount rate). Our analysis suggests that the IRL problem is highly sensitive to misspecification, in the sense that very mild misspecification can lead to very large errors in the inferred reward function.
翻译:逆向强化学习(IRL)旨在从智能体的行为(表示为策略$\pi$)推断其偏好(表示为奖励函数$R$)。为此,我们需要一个行为模型来解释$\pi$与$R$之间的关系。在现有文献中,最常用的行为模型包括最优性、玻尔兹曼理性以及因果熵最大化。然而,人类偏好与其行为之间的真实关系远比这些行为模型复杂得多。这意味着行为模型存在误设,若将其应用于真实数据,可能引发系统性误差。本文分析了IRL问题对行为模型误设的敏感性。具体而言,我们给出了充要条件,完整刻画了观测数据与假设行为模型之间的差异在多大范围内不会导致误差超过给定阈值。此外,我们还刻画了行为模型对观测策略微小扰动具有鲁棒性的条件,并分析了多种行为模型对其参数值(如折现率等)误设的鲁棒性。分析表明,IRL问题对模型误设高度敏感——极轻微的模型误设即可能导致推断出的奖励函数产生极大误差。