Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.
翻译:鉴于在线教学视频数量庞大,从视频中学习多样化的多步骤任务模型是一个极具吸引力的目标。我们提出了一种新的预训练视频模型——VideoTaskformer,其核心是表示教学视频的语义与结构。我们使用一个简单而有效的目标对VideoTaskformer进行预训练:预测从教学视频中随机遮蔽步骤的弱监督文本标签(遮蔽步骤建模)。与先前局部学习步骤表示的工作不同,我们的方法全局学习步骤表示,以整个周围任务的视频作为上下文。从这些学习到的表示中,我们能够验证未见视频是否正确执行给定任务,并预测在给定步骤后可能执行的步骤。我们引入了两个新的基准测试,用于检测教学视频中的错误,以验证是否存在异常步骤以及步骤是否按正确顺序执行。我们还引入了一个长期预测基准测试,其目标是预测给定步骤后的远期未来步骤。我们的方法在这些任务上优于以往的基线,我们相信这些任务将成为社区衡量步骤表示质量的有效途径。此外,我们在三个现有基准测试(程序性活动识别、步骤分类和步骤预测)上评估了VideoTaskformer,并在每个基准测试中证明了我们的方法优于现有基线,达到了新的最佳性能水平。