There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension.
翻译:互联网上存在大量教学视频,为我们完成各类任务提供了教程。现有的教学视频数据集仅关注视频层面的具体步骤,缺乏任务层面的经验性指导原则,这可能导致初学者因缺乏相关经验而难以学习新任务。此外,没有指导原则的具体步骤零散且不成体系,难以提供清晰的教程。为解决这些问题,我们提出了GUIDE(Guideline-Guided)数据集,该数据集包含8个日常生活相关领域的560项教学任务,共计3.5K个视频。具体而言,我们为每项教学任务标注了指导原则,该原则代表了所有相关视频共享的通用模式。在此基础上,我们标注了系统化的具体步骤,包括其关联的指导原则步骤、具体步骤描述及时间戳。我们提出的基准测试包含三个子任务,用于评估模型的理解能力:(1)步骤描述:模型需根据视频生成具体步骤的文字描述。(2)指导原则总结:模型需从任务相关视频中挖掘通用模式并总结出指导原则。(3)指导原则引导的描述生成:模型需在指导原则的引导下生成具体步骤的描述。我们使用GUIDE评估了多种基础模型并进行了深入分析。鉴于GUIDE的多样性与实用性,我们认为其可作为教学视频理解领域更优的基准测试数据集。