Visual imitation learning provides an effective framework to learn skills from demonstrations. However, the quality of the provided demonstrations usually significantly affects the ability of an agent to acquire desired skills. Therefore, the standard visual imitation learning assumes near-optimal demonstrations, which are expensive or sometimes prohibitive to collect. Previous works propose to learn from noisy demonstrations; however, the noise is usually assumed to follow a context-independent distribution such as a uniform or gaussian distribution. In this paper, we consider another crucial yet underexplored setting -- imitation learning with task-irrelevant yet locally consistent segments in the demonstrations (e.g., wiping sweat while cutting potatoes in a cooking tutorial). We argue that such noise is common in real world data and term them "extraneous" segments. To tackle this problem, we introduce Extraneousness-Aware Imitation Learning (EIL), a self-supervised approach that learns visuomotor policies from third-person demonstrations with extraneous subsequences. EIL learns action-conditioned observation embeddings in a self-supervised manner and retrieves task-relevant observations across visual demonstrations while excluding the extraneous ones. Experimental results show that EIL outperforms strong baselines and achieves comparable policies to those trained with perfect demonstration on both simulated and real-world robot control tasks. The project page can be found at https://sites.google.com/view/eil-website.
翻译:视觉模仿学习提供了一种从示范中学习技能的有效框架。然而,所提供示范的质量通常显著影响智能体获得期望技能的能力。因此,标准视觉模仿学习假设示范接近最优,但这些示范的收集成本高昂,有时甚至难以实现。以往研究提出了从含噪示范中学习的方法;然而,这些噪声通常被假设服从上下文无关的分布(如均匀或高斯分布)。本文中,我们考虑了另一个至关重要但尚未充分探索的场景——示范中存在与任务无关但局部一致的片段(例如,烹饪教程中切土豆时擦拭汗水的动作)。我们认为此类噪声在真实世界数据中普遍存在,并将其称为“无关”片段。为解决这一问题,我们提出了无关性感知模仿学习(EIL),这是一种自监督方法,能够从包含无关子序列的第三人称示范中学习视觉运动策略。EIL以自监督方式学习动作条件化的观察嵌入,并在视觉示范中检索与任务相关的观察,同时排除无关观察。实验结果表明,EIL在模拟和真实机器人控制任务中均优于强基线方法,并达到了与使用完美示范训练的策略相当的性能。项目页面可访问 https://sites.google.com/view/eil-website。