Offline Imitation Learning (IL) with imperfect demonstrations has garnered increasing attention owing to the scarcity of expert data in many real-world domains. A fundamental problem in this scenario is how to extract positive behaviors from noisy data. In general, current approaches to the problem select data building on state-action similarity to given expert demonstrations, neglecting precious information in (potentially abundant) $\textit{diverse}$ state-actions that deviate from expert ones. In this paper, we introduce a simple yet effective data selection method that identifies positive behaviors based on their resultant states -- a more informative criterion enabling explicit utilization of dynamics information and effective extraction of both expert and beneficial diverse behaviors. Further, we devise a lightweight behavior cloning algorithm capable of leveraging the expert and selected data correctly. In the experiments, we evaluate our method on a suite of complex and high-dimensional offline IL benchmarks, including continuous-control and vision-based tasks. The results demonstrate that our method achieves state-of-the-art performance, outperforming existing methods on $\textbf{20/21}$ benchmarks, typically by $\textbf{2-5x}$, while maintaining a comparable runtime to Behavior Cloning ($\texttt{BC}$).
翻译:离线模仿学习(IL)在存在不完美演示数据的情况下日益受到关注,这归因于许多现实领域中专家数据的稀缺性。该场景下的一个核心问题是如何从噪声数据中提取正向行为。通常,当前解决该问题的方法基于状态-动作与给定专家演示的相似性来选择数据,忽略了偏离专家演示的(可能大量存在的)$\textit{多样化}$状态-动作中所蕴含的宝贵信息。本文中,我们提出了一种简单而有效的数据选择方法,该方法根据行为所导致的结果状态来识别正向行为——这是一个更具信息量的标准,能够显式利用动态信息,并有效提取专家行为以及有益的多样化行为。此外,我们设计了一种轻量级的行为克隆算法,能够正确地利用专家数据和所选数据。在实验中,我们在包括连续控制和基于视觉任务在内的一系列复杂高维离线IL基准测试上评估了我们的方法。结果表明,我们的方法实现了最先进的性能,在$\textbf{20/21}$个基准测试上超越了现有方法,通常领先$\textbf{2-5倍}$,同时保持了与行为克隆($\texttt{BC}$)相当的运行时间。