Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.
翻译:程序性活动理解要求从更广泛的任务视角感知人类动作,即在长视频中按顺序执行多个关键步骤以达到最终目标状态——例如食谱步骤或DIY维修任务。先前的工作大多孤立地处理关键步骤识别,脱离这一更广泛的结构,或者严格限定关键步骤与预定义的顺序脚本对齐。我们提出从操作视频中自动发现任务图,以概率方式表示人类执行关键步骤的倾向,并利用该图对新视频中的关键步骤识别进行正则化。在多个真实教学视频数据集上,我们展示了其效果:更可靠的零样本关键步骤定位和改进的视频表示学习,超越了现有技术水平。