Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories, to understand the task for imitation or collaboration thereby removing the need for manual reward engineering. However, IRL in the context of large, high-dimensional problems with unknown dynamics has been particularly challenging. In this paper, we present a new Variational Lower Bound for IRL (VLB-IRL), which is derived under the framework of a probabilistic graphical model with an optimality node. Our method simultaneously learns the reward function and policy under the learned reward function by maximizing the lower bound, which is equivalent to minimizing the reverse Kullback-Leibler divergence between an approximated distribution of optimality given the reward function and the true distribution of optimality given trajectories. This leads to a new IRL method that learns a valid reward function such that the policy under the learned reward achieves expert-level performance on several known domains. Importantly, the method outperforms the existing state-of-the-art IRL algorithms on these domains by demonstrating better reward from the learned policy.
翻译:逆向强化学习(IRL)旨在从专家轨迹中学习奖励函数,从而理解任务以进行模仿或协作,进而避免手动设计奖励。然而,在具有未知动力学的大规模、高维问题中,IRL 尤其具有挑战性。本文提出了一种用于IRL的新颖变分下界(VLB-IRL),该下界是在带有最优性节点的概率图模型框架下推导得出的。我们的方法通过最大化下界同时学习奖励函数以及在该奖励函数下的策略,这等价于最小化在给定奖励函数下的最优性近似分布与在给定轨迹下的最优性真实分布之间的反向Kullback-Leibler散度。由此产生了一种新的IRL方法,该方法能够学习有效的奖励函数,使得在该奖励函数下的策略在多个已知领域上达到专家级性能。重要的是,该方法在这些领域上优于现有最先进的IRL算法,通过展示所学策略带来的更优奖励。