Imitation learning has achieved great success in many sequential decision-making tasks, in which a neural agent is learned by imitating collected human demonstrations. However, existing algorithms typically require a large number of high-quality demonstrations that are difficult and expensive to collect. Usually, a trade-off needs to be made between demonstration quality and quantity in practice. Targeting this problem, in this work we consider the imitation of sub-optimal demonstrations, with both a small clean demonstration set and a large noisy set. Some pioneering works have been proposed, but they suffer from many limitations, e.g., assuming a demonstration to be of the same optimality throughout time steps and failing to provide any interpretation w.r.t knowledge learned from the noisy set. Addressing these problems, we propose {\method} by evaluating and imitating at the sub-demonstration level, encoding action primitives of varying quality into different skills. Concretely, {\method} consists of a high-level controller to discover skills and a skill-conditioned module to capture action-taking policies, and is trained following a two-phase pipeline by first discovering skills with all demonstrations and then adapting the controller to only the clean set. A mutual-information-based regularization and a dynamic sub-demonstration optimality estimator are designed to promote disentanglement in the skill space. Extensive experiments are conducted over two gym environments and a real-world healthcare dataset to demonstrate the superiority of {\method} in learning from sub-optimal demonstrations and its improved interpretability by examining learned skills.
翻译:模仿学习在许多序列决策任务中取得了巨大成功,通过模仿收集到的人类演示来学习神经智能体。然而,现有算法通常需要大量高质量演示,这些演示难以且成本高昂地收集。实践中,通常需要在演示质量与数量之间进行权衡。针对这一问题,本文考虑对次优演示的模仿,同时包含少量干净演示集和大量含噪集。已有一些开创性工作被提出,但它们存在诸多局限,例如假设演示在时间步上始终具有相同的优化性,且未能对从含噪集学到的知识提供任何解释。为解决这些问题,我们提出{\method},通过在子演示层面进行评估与模仿,将不同质量的动作基元编码为不同技能。具体而言,{\method}包含一个用于发现技能的高层控制器和一个用于捕获动作执行策略的技能条件模块,并采用两阶段流水线进行训练:首先利用所有演示发现技能,然后仅针对干净集调整控制器。我们设计了基于互信息的正则化方法和动态子演示优化性估计器,以促进技能空间中的解耦。通过在两个仿真环境和真实医疗数据集上的广泛实验,证明了{\method}在从次优演示学习中的优越性,并通过检查所学技能展示了其增强的可解释性。