In this paper, we address the challenge of unsupervised mistake detection in egocentric procedural video through the analysis of gaze signals. Traditional supervised mistake detection methods rely on manually labeled mistakes, and hence suffer from domain-dependence and scalability issues. We introduce an unsupervised method for detecting mistakes in videos of human activities, overcoming the challenges of domain-specific requirements and the need for annotated data. We postulate that, when a subject is making a mistake in the execution of a procedure, their attention patterns will deviate from normality. We hence propose to detect mistakes by comparing gaze trajectories predicted from input video with ground truth gaze signals collected through a gaze tracker. Since predicting gaze in video is characterized by high uncertainty, we propose a novel \textit{gaze completion task}, which aims to predict gaze from visual observations and partial gaze trajectories. We further contribute a \textit{gaze completion approach} based on a Gaze-Frame Correlation module to explicitly model the correlation between gaze information and each local visual token. Inconsistencies between the predicted and observed gaze trajectories act as an indicator for identifying mistakes. Experiments on the EPIC-Tent, HoloAssist and IndustReal datasets showcase the effectiveness of the proposed approach as compared to unsupervised and one-class techniques. Our method is ranked first on the HoloAssist Mistake Detection challenge.
翻译:本文通过分析注视信号,解决了自我中心过程视频中无监督错误检测的挑战。传统的监督式错误检测方法依赖于人工标注的错误,因此存在领域依赖性和可扩展性问题。我们提出了一种用于检测人类活动视频中错误的无监督方法,克服了领域特定要求和标注数据需求的挑战。我们假设,当受试者在执行过程中犯错时,其注意力模式将偏离常态。因此,我们提出通过比较从输入视频预测的注视轨迹与通过眼动仪采集的真实注视信号来检测错误。由于视频中的注视预测具有高度不确定性,我们提出了一种新颖的*注视补全任务*,旨在从视觉观察和部分注视轨迹预测注视。我们进一步提出了一种基于注视-帧相关性模块的*注视补全方法*,以显式建模注视信息与每个局部视觉标记之间的相关性。预测注视轨迹与观测注视轨迹之间的不一致性可作为识别错误的指标。在EPIC-Tent、HoloAssist和IndustReal数据集上的实验表明,与无监督和单类技术相比,所提方法具有有效性。我们的方法在HoloAssist错误检测挑战赛中排名第一。