Imitation Learning (IL) is a promising paradigm for teaching robots to perform novel tasks using demonstrations. Most existing approaches for IL utilize neural networks (NN), however, these methods suffer from several well-known limitations: they 1) require large amounts of training data, 2) are hard to interpret, and 3) are hard to repair and adapt. There is an emerging interest in programmatic imitation learning (PIL), which offers significant promise in addressing the above limitations. In PIL, the learned policy is represented in a programming language, making it amenable to interpretation and repair. However, state-of-the-art PIL algorithms assume access to action labels and struggle to learn from noisy real-world demonstrations. In this paper, we propose PLUNDER, a novel PIL algorithm that integrates a probabilistic program synthesizer in an iterative Expectation-Maximization (EM) framework to address these shortcomings. Unlike existing PIL approaches, PLUNDER synthesizes probabilistic programmatic policies that are particularly well-suited for modeling the uncertainties inherent in real-world demonstrations. Our approach leverages an EM loop to simultaneously infer the missing action labels and the most likely probabilistic policy. We benchmark PLUNDER against several established IL techniques, and demonstrate its superiority across five challenging imitation learning tasks under noise. PLUNDER policies achieve 95% accuracy in matching the given demonstrations, outperforming the next best baseline by 19%. Additionally, policies generated by PLUNDER successfully complete the tasks 17% more frequently than the nearest baseline.
翻译:模仿学习(IL)是一种利用演示教会机器人执行新任务的具有前景的范式。现有大多数IL方法采用神经网络(NN),然而这些方法存在几个众所周知的局限性:它们(1)需要大量训练数据,(2)难以解释,且(3)难以修复和适应。程序化模仿学习(PIL)作为一种新兴方向,在解决上述局限性方面展现出重要潜力。在PIL中,学得的策略以编程语言表示,使其易于解释和修复。然而,最先进的PIL算法假设可获取动作标签,且难以从含噪的真实世界演示中学习。本文提出PLUNDER算法,一种将概率程序合成器集成于迭代期望最大化(EM)框架中的新型PIL方法,以解决上述不足。与现有PIL方法不同,PLUNDER合成的概率程序化策略特别适合建模真实世界演示中固有的不确定性。本方法利用EM循环同时推断缺失的动作标签和最优概率策略。我们将PLUNDER与多种经典IL技术进行基准测试,证明其在五个含噪挑战性模仿学习任务中的优越性。PLUNDER策略在匹配给定演示方面达到95%的准确率,较次优基线方法提升19%。此外,PLUNDER生成的策略完成任务频率比最近基线方法高17%。