Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an \emph{expert policy} -- plays a critical role in developing intelligent systems, such as those that understand and imitate human behavior. While widely used in applications, theoretical understandings of IRL admit unique challenges and remain less developed compared with standard RL theory. For example, it remains open how to do IRL efficiently in standard \emph{offline} settings with pre-collected data, where states are obtained from a \emph{behavior policy} (which could be the expert policy itself), and actions are sampled from the expert policy. This paper provides the first line of results for efficient IRL in vanilla offline and online settings using polynomial samples and runtime. We first design a new IRL algorithm for the offline setting, Reward Learning with Pessimism (RLP), and show that it achieves polynomial sample complexity in terms of the size of the MDP, a concentrability coefficient between the behavior policy and the expert policy, and the desired accuracy. Building on RLP, we further design an algorithm Reward Learning with Exploration (RLE), which operates in a natural online setting where the learner can both actively explore the environment and query the expert policy, and obtain a stronger notion of IRL guarantee from polynomial samples. We establish sample complexity lower bounds for both settings showing that RLP and RLE are nearly optimal. Finally, as an application, we show that the learned reward functions can \emph{transfer} to another target MDP with suitable guarantees when the target MDP satisfies certain similarity assumptions with the original (source) MDP.
翻译:逆强化学习(IRL)——从专家策略的演示中学习奖励函数的问题——在开发智能系统(例如理解和模仿人类行为的系统)中扮演关键角色。尽管在应用中被广泛使用,但IRL的理论理解面临独特挑战,且与标准RL理论相比仍发展不足。例如,如何在利用预收集数据的标准离线设置中高效进行IRL仍是未解问题,此设置中状态来自行为策略(可能即为专家策略本身),而动作则从专家策略中采样。本文首次在原始离线与在线设置中利用多项式样本和运行时为高效IRL提供了一系列结果。我们首先为离线场景设计了一种新的IRL算法——悲观奖励学习(RLP),并证明它相对于MDP规模、行为策略与专家策略之间的可集中性系数以及所需精度实现了多项式样本复杂度。基于RLP,我们进一步设计了探索奖励学习(RLE)算法,该算法运行于自然在线场景中,学习者既能主动探索环境,又能查询专家策略,从而从多项式样本中获得更强的IRL保证。我们为两种场景建立了样本复杂度下界,表明RLP和RLE近乎最优。最后,作为应用,我们展示了当目标MDP与原始(源)MDP满足特定相似性假设时,学习到的奖励函数能以适当保证转移到另一目标MDP。