Supervised imitation learning, also known as behavioral cloning, suffers from distribution drift leading to failures during policy execution. One approach to mitigate this issue is to allow an expert to correct the agent's actions during task execution, based on the expert's determination that the agent has reached a `point of no return.' The agent's policy is then retrained using this new corrective data. This approach alone can enable high-performance agents to be learned, but at a substantial cost: the expert must vigilantly observe execution until the policy reaches a specified level of success, and even at that point, there is no guarantee that the policy will always succeed. To address these limitations, we present FIRE (Failure Identification to Reduce Expert Burden in intervention-based learning), a system that can predict when a running policy will fail, halt its execution, and request a correction from the expert. Unlike existing approaches that learn only from expert data, our approach learns from both expert and non-expert data, akin to adversarial learning. We demonstrate experimentally for a series of challenging manipulation tasks that our method is able to recognize state-action pairs that lead to failures. This permits seamless integration into an intervention-based learning system, where we show an order-of-magnitude gain in sample efficiency compared with a state-of-the-art inverse reinforcement learning method and dramatically improved performance over an equivalent amount of data learned with behavioral cloning.
翻译:监督式模仿学习(又称行为克隆)存在分布漂移问题,会导致策略执行过程中的失败。缓解该问题的一种方法是允许专家在任务执行过程中,基于对智能体已抵达"不可逆点"的判断来修正其动作。随后使用新的修正数据重新训练智能体策略。该方法本身可以训练出高性能智能体,但代价高昂:专家需全程密切关注执行过程直至策略达到指定成功水平,且即便达到该水平也无法保证策略始终成功。为克服这些局限,我们提出FIRE系统(基于干预的学习中通过故障识别降低专家负担),该系统可预测运行中的策略何时会失败、终止其执行并向专家请求修正。与仅从专家数据学习的现有方法不同,我们的方法同时从专家数据与非专家数据中学习,类似于对抗式学习。我们通过一系列具有挑战性的操作任务实验证明,该方法能够识别导致失败的状态-动作对。这使得该方法可无缝集成至基于干预的学习系统中,实验表明,与当前最先进的逆强化学习方法相比,该方法在样本效率上取得数量级提升,且在使用等量数据时性能显著优于行为克隆方法。