It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally characterise the partial identifiability of the reward function given several popular reward learning data sources, including expert demonstrations and trajectory comparisons. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation. We unify our results in a framework for comparing data sources and downstream tasks by their invariances, with implications for the design and selection of data sources for reward learning.
翻译:手动为复杂现实任务设计奖励函数通常极具挑战性。为解决这一问题,可采用奖励学习从数据中推断奖励函数。然而,即便在无限数据极限下,也常存在多个与数据拟合程度同样好的奖励函数,这意味着奖励函数仅具有部分可识别性。本研究正式刻画了在多种主流奖励学习数据源(包括专家示范与轨迹比较)下奖励函数的部分可识别性,并分析了这种部分可识别性对策略优化等下游任务的影响。我们将研究成果统一于一个通过不变性比较数据源与下游任务的框架中,该框架对奖励学习数据源的设计与选择具有启示意义。