How well do reward functions learned with inverse reinforcement learning (IRL) generalize? We illustrate that state-of-the-art IRL algorithms, which maximize a maximum-entropy objective, learn rewards that overfit to the demonstrations. Such rewards struggle to provide meaningful rewards for states not covered by the demonstrations, a major detriment when using the reward to learn policies in new situations. We introduce BC-IRL a new inverse reinforcement learning method that learns reward functions that generalize better when compared to maximum-entropy IRL approaches. In contrast to the MaxEnt framework, which learns to maximize rewards around demonstrations, BC-IRL updates reward parameters such that the policy trained with the new reward matches the expert demonstrations better. We show that BC-IRL learns rewards that generalize better on an illustrative simple task and two continuous robotic control tasks, achieving over twice the success rate of baselines in challenging generalization settings.
翻译:通过逆向强化学习(IRL)学习的奖励函数泛化能力如何?我们证明,最大化最大熵目标的现有最优IRL算法会学习到过度拟合示范数据的奖励函数。这类奖励函数无法为示范未覆盖的状态提供有效奖励,这在利用奖励函数学习新情境下的策略时构成重大障碍。我们提出BC-IRL——一种新型逆向强化学习方法,相较于最大熵IRL方法,其学习的奖励函数具有更优的泛化能力。与最大熵框架中学习在示范周围最大化奖励的方式不同,BC-IRL通过更新奖励参数,使得基于新奖励训练的策略更优地匹配专家示范。我们在一个典型简单任务和两个连续机器人控制任务中证明,BC-IRL学习的奖励函数具有更强的泛化能力,在具有挑战性的泛化场景中,其成功率超过基线方法的两倍以上。