As robotic systems execute increasingly difficult task sequences, so does the number of ways in which they can fail. Video Anomaly Detection (VAD) frameworks typically focus on singular, low-level kinematic or action failures, struggling to identify more complex temporal or spatial task violations, because they do not necessarily manifest as low-level execution errors. To address this problem, the main contribution of this paper is a new VAD-inspired architecture, TIMID, which is able to detect robot time-dependent mistakes when executing high-level tasks. Our architecture receives as inputs a video and prompts of the task and the potential mistake, and returns a frame-level prediction in the video of whether the mistake is present or not. By adopting a VAD formulation, the model can be trained with weak supervision, requiring only a single label per video. Additionally, to alleviate the problem of data scarcity of incorrect executions, we introduce a multi-robot simulation dataset with controlled temporal errors and real executions for zero-shot sim-to-real evaluation. Our experiments demonstrate that out-of-the-box VLMs lack the explicit temporal reasoning required for this task, whereas our framework successfully detects different types of temporal errors. Project: https://ropertunizar.github.io/TIMID/
翻译:随着机器人系统执行的任务序列日益复杂,其可能发生故障的方式也相应增多。视频异常检测(VAD)框架通常侧重于单一的低级运动学或动作故障,难以识别更复杂的时间性或空间性任务违规,因为这些违规并不必然表现为低级执行错误。为解决这一问题,本文的主要贡献是提出了一种新的受VAD启发的架构——TIMID,该架构能够检测机器人在执行高级任务时出现的时间依赖性错误。我们的架构接收视频以及任务描述和潜在错误提示作为输入,并输出视频中错误存在与否的帧级预测。通过采用VAD的表述形式,该模型能够使用弱监督进行训练,仅需每个视频一个标签。此外,为缓解错误执行数据稀缺的问题,我们引入了一个多机器人仿真数据集,其中包含受控的时间错误和真实执行记录,用于零样本仿真到现实的评估。我们的实验表明,现成的视觉语言模型(VLM)缺乏完成此任务所需的显式时间推理能力,而我们的框架则能成功检测不同类型的时间错误。项目地址:https://ropertunizar.github.io/TIMID/