Procedure-Aware Pretraining for Instructional Video Understanding

Our goal is to learn a video representation that is useful for downstream procedure understanding tasks in instructional videos. Due to the small amount of available annotations, a key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge such as the identity of the task (e.g., 'make latte'), its steps (e.g., 'pour milk'), or the potential next steps given partial progress in its execution. Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks, and that this structure can be well represented by a Procedural Knowledge Graph (PKG), where nodes are discrete steps and edges connect steps that occur sequentially in the instructional activities. This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form to generalize to multiple procedure understanding tasks. We build a PKG by combining information from a text-based procedural knowledge database and an unlabeled instructional video corpus and then use it to generate training pseudo labels with four novel pre-training objectives. We call this PKG-based pre-training procedure and the resulting model Paprika, Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23% gains in accuracy in 12 evaluation settings. Implementation is available at https://github.com/salesforce/paprika.

翻译：我们的目标是学习一种适用于下游指令视频流程理解任务的视频表示。由于可用标注数据量有限，流程理解的关键挑战在于从无标签视频中提取程序性知识，例如任务标识（如“制作拿铁”）、其步骤（如“倒牛奶”），或在执行过程中基于部分进度预测可能的下一个步骤。我们的核心洞察在于：指令视频描绘了同一或不同任务实例间重复出现的步骤序列，且这种结构可以通过程序性知识图（PKG）有效表示——其中节点为离散步骤，边连接教学活动中顺序发生的步骤。该图可用于生成伪标签，以训练视频表示将程序性知识编码为更易访问的形式，从而泛化至多种流程理解任务。我们通过结合基于文本的程序性知识数据库与无标签指令视频语料库构建PKG，并利用其基于四个新颖的预训练目标生成训练伪标签。我们将这种基于PKG的预训练流程及所得模型称为Paprika（流程感知的指令知识获取预训练）。我们在COIN和CrossTask数据集上评估Paprika在任务识别、步骤识别及步骤预测等流程理解任务中的表现。Paprika生成的视频表示在12种评估设置中实现了高达11.23%的准确率提升，超越当前最优水平。实现代码见https://github.com/salesforce/paprika。