This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL), namely the reward extrapolation error, where the learned reward function may fail to explain the task correctly and misguide the agent in unseen environments due to the intrinsic covariate shift. Leveraging both expert data and lower-quality diverse data, we devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy, based on which we characterize the impact of covariate shift by examining subtle two-tier tradeoffs between the exploitation (on both expert and diverse data) and exploration (on the estimated dynamics model). We show that CLARE can provably alleviate the reward extrapolation error by striking the right exploitation-exploration balance therein. Extensive experiments corroborate the significant performance gains of CLARE over existing state-of-the-art algorithms on MuJoCo continuous control tasks (especially with a small offline dataset), and the learned reward is highly instructive for further learning.
翻译:本文旨在解决离线逆强化学习中的一项重大挑战——奖励外推误差,即因固有的协变量偏移导致学得的奖励函数无法正确解释任务,并在未知环境中误导智能体。通过结合专家数据与较低质量的多样化数据,我们设计了一种原则性算法(CLARE),该算法将"保守性"整合到学得的奖励函数中,并利用估计的动态模型高效求解离线IRL。我们的理论分析给出了学得策略与专家策略之间的回报差距上界,并基于此通过考察在利用(专家数据与多样化数据)与探索(基于估计动态模型)之间的微妙双层权衡,刻画了协变量偏移的影响。研究表明,CLARE可通过在此过程中实现正确的利用-探索平衡,从理论上缓解奖励外推误差。大量实验证实,在MuJoCo连续控制任务上(尤其是在离线数据集较小时),CLARE相较于现有最优算法具有显著性能优势,且学得的奖励对后续学习具有高度指导意义。