In offline Imitation Learning (IL), one of the main challenges is the \textit{covariate shift} between the expert observations and the actual distribution encountered by the agent, because it is difficult to determine what action an agent should take when outside the state distribution of the expert demonstrations. Recently, the model-free solutions introduce the supplementary data and identify the latent expert-similar samples to augment the reliable samples during learning. Model-based solutions build forward dynamic models with conservatism quantification and then generate additional trajectories in the neighborhood of expert demonstrations. However, without reward supervision, these methods are often over-conservative in the out-of-expert-support regions, because only in states close to expert-observed states can there be a preferred action enabling policy optimization. To encourage more exploration on expert-unobserved states, we propose a novel model-based framework, called offline Imitation Learning with Self-paced Reverse Augmentation (SRA). Specifically, we build a reverse dynamic model from the offline demonstrations, which can efficiently generate trajectories leading to the expert-observed states in a self-paced style. Then, we use the subsequent reinforcement learning method to learn from the augmented trajectories and transit from expert-unobserved states to expert-observed states. This framework not only explores the expert-unobserved states but also guides maximizing long-term returns on these states, ultimately enabling generalization beyond the expert data. Empirical results show that our proposal could effectively mitigate the covariate shift and achieve the state-of-the-art performance on the offline imitation learning benchmarks. Project website: \url{https://www.lamda.nju.edu.cn/shaojj/KDD24_SRA/}.
翻译:在离线模仿学习中,主要挑战之一在于专家观测数据与智能体实际遇到的状态分布之间的**协变量偏移**,因为当智能体处于专家示范状态分布之外时,难以确定其应采取何种行动。近期,无模型解决方案通过引入补充数据并识别潜在的专家相似样本来增强学习过程中的可靠样本。基于模型的解决方案则通过构建具有保守性量化的前向动力学模型,在专家示范邻域内生成额外轨迹。然而,在缺乏奖励监督的情况下,这些方法在专家支持区域外往往过于保守,因为只有在接近专家观测状态时,才存在能够指导策略优化的优选行动。为鼓励在专家未观测状态上进行更多探索,我们提出了一种新颖的基于模型框架——基于自步调反向增强的离线模仿学习。具体而言,我们基于离线示范数据构建反向动力学模型,该模型能以自步调方式高效生成导向专家观测状态的轨迹。随后,我们采用后续强化学习方法从增强轨迹中学习,实现从专家未观测状态向专家观测状态的转移。该框架不仅探索专家未观测状态,同时指导在这些状态下最大化长期回报,最终实现超越专家数据的泛化能力。实验结果表明,所提方法能有效缓解协变量偏移问题,并在离线模仿学习基准测试中达到最先进性能。项目网站:\url{https://www.lamda.nju.edu.cn/shaojj/KDD24_SRA/}。