Inverse Reinforcement Learning (IRL) is a powerful framework for learning complex behaviors from expert demonstrations. However, it traditionally requires repeatedly solving a computationally expensive reinforcement learning (RL) problem in its inner loop. It is desirable to reduce the exploration burden by leveraging expert demonstrations in the inner-loop RL. As an example, recent work resets the learner to expert states in order to inform the learner of high-reward expert states. However, such an approach is infeasible in the real world. In this work, we consider an alternative approach to speeding up the RL subroutine in IRL: \emph{pessimism}, i.e., staying close to the expert's data distribution, instantiated via the use of offline RL algorithms. We formalize a connection between offline RL and IRL, enabling us to use an arbitrary offline RL algorithm to improve the sample efficiency of IRL. We validate our theory experimentally by demonstrating a strong correlation between the efficacy of an offline RL algorithm and how well it works as part of an IRL procedure. By using a strong offline RL algorithm as part of an IRL procedure, we are able to find policies that match expert performance significantly more efficiently than the prior art.
翻译:逆强化学习(IRL)是一个从专家示范中学习复杂行为的强大框架。然而,传统方法需要在内部循环中反复求解计算昂贵的强化学习(RL)问题。通过在内环强化学习中利用专家示范来减少探索负担是理想之举。例如,近期研究将学习者重置为专家状态,以告知学习者高奖励的专家状态。然而,这种方法在现实世界中不可行。在本研究中,我们考虑另一种加速IRL中RL子程序的方法:\emph{悲观主义},即保持接近专家数据分布,通过使用离线强化学习算法实现。我们形式化了离线强化学习与IRL之间的联系,从而能够使用任意离线强化学习算法来提升IRL的样本效率。我们通过实验验证了该理论,展示了离线强化学习算法的有效性与其作为IRL程序一部分时的性能之间存在强相关性。通过将强离线强化学习算法作为IRL程序的一部分,我们能够找到显著优于现有技术、匹配专家性能的策略。