Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.
翻译:从第一人称视角视频中在线识别程序性错误,是制造业、医疗保健和技能培训等多个领域的关键但极具挑战性的任务。此类错误本质上是开放集的,因为可能出现不可预见或新颖的错误,这就要求检测系统不依赖于先前的失败示例,且具备鲁棒性。然而,目前尚无技术能有效在线检测开放集的程序性错误。为解决此问题,我们提出了一种在线运行的双分支架构:一个分支持续从输入的第一人称视角视频中执行步骤识别,而另一个分支则基于识别模块的输出预测未来步骤。当当前识别的动作与预测模块预测的动作不匹配时,即检测到错误。识别分支接收输入帧,预测当前动作,并将帧级结果聚合为动作标记。具体而言,预测分支利用大语言模型强大的模式匹配能力,基于先前预测的动作标记来预测后续动作标记。考虑到任务的在线性质,我们还深入评估了与逐帧评估相关的难点,特别是在动态在线场景中对准确且及时预测的需求。在两个程序性数据集上进行的大量实验,揭示了利用双分支架构进行错误检测所面临的挑战与机遇,并展示了我们提出方法的有效性。在一项包含识别与预测变体以及最先进模型的全面评估中,我们的方法展现了其在在线应用中的鲁棒性和有效性。