Robot behavior policies trained via imitation learning are prone to failure under conditions that deviate from their training data. Thus, algorithms that monitor learned policies at test time and provide early warnings of failure are necessary to facilitate scalable deployment. We propose Sentinel, a runtime monitoring framework that splits the detection of failures into two complementary categories: 1) Erratic failures, which we detect using statistical measures of temporal action consistency, and 2) task progression failures, where we use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task. Our approach has two key strengths. First, because learned policies exhibit diverse failure modes, combining complementary detectors leads to significantly higher accuracy at failure detection. Second, using a statistical temporal action consistency measure ensures that we quickly detect when multimodal, generative policies exhibit erratic behavior at negligible computational cost. In contrast, we only use VLMs to detect failure modes that are less time-sensitive. We demonstrate our approach in the context of diffusion policies trained on robotic mobile manipulation domains in both simulation and the real world. By unifying temporal consistency detection and VLM runtime monitoring, Sentinel detects 18% more failures than using either of the two detectors alone and significantly outperforms baselines, thus highlighting the importance of assigning specialized detectors to complementary categories of failure. Qualitative results are made available at https://sites.google.com/stanford.edu/sentinel.
翻译:通过模仿学习训练的机器人行为策略在偏离其训练数据的条件下容易发生故障。因此,在测试时监控学习到的策略并提供故障早期预警的算法对于实现可扩展部署至关重要。我们提出Sentinel,一个运行时监控框架,将故障检测分为两个互补的类别:1)异常故障,我们使用时间动作一致性的统计度量进行检测;2)任务进展故障,我们使用视觉语言模型(VLMs)来检测策略何时自信且持续地执行无法解决任务的动作。我们的方法有两个关键优势。首先,由于学习到的策略表现出多样化的故障模式,结合互补的检测器能显著提高故障检测的准确性。其次,使用统计时间动作一致性度量确保我们能以可忽略的计算成本快速检测多模态生成策略何时表现出异常行为。相比之下,我们仅将VLMs用于检测时间敏感性较低的故障模式。我们在仿真和现实世界的机器人移动操作领域,基于扩散策略训练的背景下验证了我们的方法。通过统一时间一致性检测和VLM运行时监控,Sentinel比单独使用任一检测器多检测出18%的故障,并显著优于基线方法,从而凸显了为互补的故障类别分配专门检测器的重要性。定性结果可在https://sites.google.com/stanford.edu/sentinel获取。