Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories, to understand the task for imitation or collaboration thereby removing the need for manual reward engineering. However, IRL in the context of large, high-dimensional problems with unknown dynamics has been particularly challenging. In this paper, we present a new Variational Lower Bound for IRL (VLB-IRL), which is derived under the framework of a probabilistic graphical model with an optimality node. Our method simultaneously learns the reward function and policy under the learned reward function by maximizing the lower bound, which is equivalent to minimizing the reverse Kullback-Leibler divergence between an approximated distribution of optimality given the reward function and the true distribution of optimality given trajectories. This leads to a new IRL method that learns a valid reward function such that the policy under the learned reward achieves expert-level performance on several known domains. Importantly, the method outperforms the existing state-of-the-art IRL algorithms on these domains by demonstrating better reward from the learned policy.
翻译:逆向强化学习(IRL)旨在从专家轨迹中学习奖励函数,从而理解模仿或协作任务,免去人工设计奖励的繁琐过程。然而,在具有未知动态特性的高维复杂问题中实施IRL一直极具挑战性。本文提出一种新的逆向强化学习变分下界方法(VLB-IRL),该方法基于包含最优性节点的概率图模型框架推导得出。通过最大化该下界,我们同时学习奖励函数及其对应的策略——这一优化过程等价于最小化给定奖励函数下的最优性近似分布与基于轨迹的真实最优性分布之间的反向KL散度。由此产生的新型IRL方法能学习到有效的奖励函数,使得基于该奖励的策略在多个已知基准域中达到专家级性能。更重要的是,该方法通过展示学习策略带来的更优奖励,在这些基准域上超越了现有最先进的IRL算法。