Offline imitation learning enables learning a policy solely from a set of expert demonstrations, without any environment interaction. To alleviate the issue of distribution shift arising due to the small amount of expert data, recent works incorporate large numbers of auxiliary demonstrations alongside the expert data. However, the performance of these approaches rely on assumptions about the quality and composition of the auxiliary data. However, they are rarely successful when those assumptions do not hold. To address this limitation, we propose Robust Offline Imitation from Diverse Auxiliary Data (ROIDA). ROIDA first identifies high-quality transitions from the entire auxiliary dataset using a learned reward function. These high-reward samples are combined with the expert demonstrations for weighted behavioral cloning. For lower-quality samples, ROIDA applies temporal difference learning to steer the policy towards high-reward states, improving long-term returns. This two-pronged approach enables our framework to effectively leverage both high and low-quality data without any assumptions. Extensive experiments validate that ROIDA achieves robust and consistent performance across multiple auxiliary datasets with diverse ratios of expert and non-expert demonstrations. ROIDA effectively leverages unlabeled auxiliary data, outperforming prior methods reliant on specific data assumptions.
翻译:离线模仿学习仅需专家演示数据集即可学习策略,无需环境交互。为缓解专家数据量不足导致的分布偏移问题,近期研究尝试在专家数据基础上引入大量辅助演示数据。然而,此类方法的性能依赖于对辅助数据质量与构成的前提假设,当假设不成立时往往难以取得理想效果。为突破这一局限,本文提出基于多样化辅助数据的鲁棒离线模仿学习方法。该方法首先通过习得的奖励函数从全部辅助数据中筛选高质量状态转移样本,将其与专家演示数据结合进行加权行为克隆。针对低质量样本,该方法应用时序差分学习引导策略向高奖励状态演进,从而提升长期回报。这种双管齐下的策略使我们的框架无需任何前提假设即可有效利用高质量与低质量数据。大量实验验证表明,该方法在包含不同专家与非专家演示比例的多种辅助数据集上均能实现鲁棒且稳定的性能表现,其利用未标注辅助数据的有效性超越了依赖特定数据假设的现有方法。