Inverse reinforcement learning (IRL) aims to infer a reward from expert demonstrations, motivated by the idea that the reward, rather than the policy, is the most succinct and transferable description of a task [Ng et al., 2000]. However, the reward corresponding to an optimal policy is not unique, making it unclear if an IRL-learned reward is transferable to new transition laws in the sense that its optimal policy aligns with the optimal policy corresponding to the expert's true reward. Past work has addressed this problem only under the assumption of full access to the expert's policy, guaranteeing transferability when learning from two experts with the same reward but different transition laws that satisfy a specific rank condition [Rolland et al., 2022]. In this work, we show that the conditions developed under full access to the expert's policy cannot guarantee transferability in the more practical scenario where we have access only to demonstrations of the expert. Instead of a binary rank condition, we propose principal angles as a more refined measure of similarity and dissimilarity between transition laws. Based on this, we then establish two key results: 1) a sufficient condition for transferability to any transition laws when learning from at least two experts with sufficiently different transition laws, and 2) a sufficient condition for transferability to local changes in the transition law when learning from a single expert. Furthermore, we also provide a probably approximately correct (PAC) algorithm and an end-to-end analysis for learning transferable rewards from demonstrations of multiple experts.
翻译:逆强化学习(IRL)旨在从专家示范中推断奖励函数,其动机在于奖励(而非策略)是对任务最简洁且可迁移的描述 [Ng et al., 2000]。然而,对应最优策略的奖励函数并非唯一,这使得IRL学得的奖励是否可迁移至新的状态转移规律变得不明确——即其最优策略是否与专家真实奖励对应的最优策略保持一致。以往研究仅在完全获取专家策略的假设下处理此问题,保证当从两位具有相同奖励但满足特定秩条件的、不同状态转移规律的专家处学习时,奖励具有可迁移性 [Rolland et al., 2022]。本工作中,我们证明在仅能获取专家示范(而非完整策略)这一更实际的场景下,基于完全获取专家策略所建立的条件无法保证可迁移性。我们提出以主夹角作为衡量状态转移规律间相似性与差异性的更精细度量,以替代二元的秩条件。基于此,我们建立了两个关键结果:1)当从至少两位具有充分不同状态转移规律的专家处学习时,奖励可迁移至任意状态转移规律的充分条件;2)当从单一位专家处学习时,奖励可迁移至状态转移规律局部变化的充分条件。此外,我们还提供了一个可能近似正确(PAC)算法及端到端分析框架,用于从多位专家的示范中学习可迁移的奖励函数。