Standard imitation learning usually assumes that demonstrations are drawn from an optimal policy distribution. However, in the real world, where every human demonstration may exhibit nearly random behavior, the cost of collecting high-quality human datasets can be quite costly. This requires robots to be able to learn from imperfect demonstrations and thus acquire behavioral policy that align human intent. Prior work uses confidence scores to extract useful information from imperfect demonstrations, which relies on access to ground truth rewards or active human supervision. In this paper, we propose a dynamics-based method to obtain fine-grained confidence scores for data without the above efforts. We develop a generalized confidence-based imitation learning framework called Confidence-based Inverse soft-Q Learning (CIQL), which can employ different policy learning methods by changing object functions. Experimental results show that our confidence evaluation method can increase the success rate of the original algorithm by $40.3\%$, which is $13.5\%$ higher than the method of just filtering noise.
翻译:标准模仿学习通常假设演示来自最优策略分布。然而在现实世界中,当每个人类演示都可能表现出近乎随机的行为时,收集高质量人类数据集所需的成本可能相当高昂。这就要求机器人能够从次优演示中学习,从而获得与人类意图对齐的行为策略。先前的研究通过使用置信度分数从次优演示中提取有效信息,但这依赖于获取真实奖励或主动的人类监督。在本文中,我们提出了一种基于动力学的方法,无需上述额外努力即可获得数据的细粒度置信度分数。我们开发了一种通用的基于置信度的模仿学习框架,称为基于置信度的逆软Q学习(CIQL),该框架可通过改变目标函数来采用不同的策略学习方法。实验结果表明,我们的置信度评估方法能使原始算法的成功率提升40.3%,比单纯过滤噪声的方法高出13.5%。