We address the challenge of unsupervised mistake detection in egocentric video of skilled human activities through the analysis of gaze signals. While traditional methods rely on manually labeled mistakes, our approach does not require mistake annotations, hence overcoming the need of domain-specific labeled data. Based on the observation that eye movements closely follow object manipulation activities, we assess to what extent eye-gaze signals can support mistake detection, proposing to identify deviations in attention patterns measured through a gaze tracker with respect to those estimated by a gaze prediction model. Since predicting gaze in video is characterized by high uncertainty, we propose a novel gaze completion task, where eye fixations are predicted from visual observations and partial gaze trajectories, and contribute a novel gaze completion approach which explicitly models correlations between gaze information and local visual tokens. Inconsistencies between predicted and observed gaze trajectories act as an indicator to identify mistakes. Experiments highlight the effectiveness of the proposed approach in different settings, with relative gains up to +14%, +11%, and +5% in EPIC-Tent, HoloAssist and IndustReal respectively, remarkably matching results of supervised approaches without seeing any labels. We further show that gaze-based analysis is particularly useful in the presence of skilled actions, low action execution confidence, and actions requiring hand-eye coordination and object manipulation skills. Our method is ranked first on the HoloAssist Mistake Detection challenge.
翻译:我们通过分析眼动信号,解决了第一人称视角下熟练人类活动视频中无监督错误检测的挑战。传统方法依赖人工标注的错误,而我们的方法无需错误标注,从而克服了对领域特定标注数据的需求。基于眼动密切跟随物体操作活动的观察,我们评估了眼动信号能在多大程度上支持错误检测,提出通过识别眼动追踪器测量的注意力模式与眼动预测模型估计模式之间的偏差来实现检测。由于视频中的眼动预测具有高度不确定性,我们提出一种新颖的眼动补全任务——从视觉观察和部分眼动轨迹预测眼动注视点,并贡献了一种显式建模眼动信息与局部视觉标记相关性的新型眼动补全方法。预测眼动轨迹与观测眼动轨迹之间的不一致性可作为识别错误的指标。实验证明了所提方法在不同场景下的有效性,在EPIC-Tent、HoloAssist和IndustReal数据集上分别取得高达+14%、+11%和+5%的相对提升,其显著效果与有监督方法相当且无需任何标注。我们进一步证明,眼动分析在涉及熟练动作、低动作执行置信度以及需要手眼协调和物体操作技能的场景中尤为有效。我们的方法在HoloAssist错误检测挑战赛中位列第一。