Inverse Reinforcement Learning (IRL) is a powerful paradigm for inferring a reward function from expert demonstrations. Many IRL algorithms require a known transition model and sometimes even a known expert policy, or they at least require access to a generative model. However, these assumptions are too strong for many real-world applications, where the environment can be accessed only through sequential interaction. We propose a novel IRL algorithm: Active exploration for Inverse Reinforcement Learning (AceIRL), which actively explores an unknown environment and expert policy to quickly learn the expert's reward function and identify a good policy. AceIRL uses previous observations to construct confidence intervals that capture plausible reward functions and find exploration policies that focus on the most informative regions of the environment. AceIRL is the first approach to active IRL with sample-complexity bounds that does not require a generative model of the environment. AceIRL matches the sample complexity of active IRL with a generative model in the worst case. Additionally, we establish a problem-dependent bound that relates the sample complexity of AceIRL to the suboptimality gap of a given IRL problem. We empirically evaluate AceIRL in simulations and find that it significantly outperforms more naive exploration strategies.
翻译:逆强化学习(IRL)是从专家演示中推断奖励函数的强大范式。许多IRL算法需要已知的状态转移模型,有时甚至需要已知的专家策略,或至少需要访问生成模型。然而,对于许多实际应用而言,这些假设过于严苛——在这些应用中,环境只能通过顺序交互来访问。我们提出一种新型IRL算法:面向逆强化学习的主动探索(AceIRL),该算法通过主动探索未知环境和专家策略,快速学习专家的奖励函数并识别出最优策略。AceIRL利用历史观测构建能够捕捉可能奖励函数的置信区间,并寻找聚焦于环境中最具信息量区域的探索策略。AceIRL是首个无需环境生成模型且具备样本复杂度保证的主动IRL方法,在最坏情况下其样本复杂度与使用生成模型的主动IRL相当。此外,我们建立了一个问题相关的界,将AceIRL的样本复杂度与给定IRL问题的次优性差距相关联。我们在仿真实验中评估了AceIRL,结果表明其性能显著优于更朴素的探索策略。