We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn disciriminative representations for various steps without any labels. Different from prior works, we develop techniques to train a light-weight temporal module which uses off-the-shelf features for self supervision. Our approach can seamlessly leverage information from multiple cues like optical flow, depth or gaze to learn discriminative features for key-steps making it amenable for AR applications. We finally extract key steps via a tunable algorithm that clusters the representations and samples. We show significant improvements over prior works for the task of key step localization and phase classification. Qualitative results demonstrate that the extracted key steps are meaningful to succinctly represent various steps of the procedural tasks.
翻译:我们针对从无标签程序性视频中提取关键步骤的问题展开研究,其动机源于增强现实(AR)头戴设备在革新职业培训与绩效表现方面的潜力。我们将该问题分解为两个子任务:表示学习与关键步骤提取。我们提出一种训练目标——自举多线索对比(BMC2)损失函数,以无标签方式学习各步骤的判别性表示。与现有研究不同,我们开发了训练轻量级时序模块的技术,该模块利用现成特征实现自监督学习。我们的方法能够无缝整合光流、深度、注视等多线索信息,学习关键步骤的判别性特征,从而适用于AR应用场景。最终,我们通过可调算法对表示与样本进行聚类,提取关键步骤。在关键步骤定位与阶段分类任务中,该方法相较现有研究取得显著性能提升。定性结果表明,所提取的关键步骤能简洁有效地表征程序性任务的各阶段。