Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks, especially unfamiliar ones. However, current methods for generative script learning rely heavily on well-structured preceding steps described in text and/or images or are limited to a certain domain, resulting in a disparity with real-world user scenarios. To address these limitations, we present a new benchmark challenge -- MultiScript, with two new tasks on task-oriented multimodal script learning: (1) multimodal script generation, and (2) subsequent step prediction. For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task, and the expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step, respectively. Built from WikiHow, MultiScript covers multimodal scripts in videos and text descriptions for over 6,655 human everyday tasks across 19 diverse domains. To establish baseline performance on MultiScript, we propose two knowledge-guided multimodal generative frameworks that incorporate the task-related knowledge prompted from large language models such as Vicuna. Experimental results show that our proposed approaches significantly improve over the competitive baselines.
翻译:摘要:从视频演示中自动生成脚本(即文本描述的关键步骤序列)并推理后续步骤,对于现代AI虚拟助手引导人类完成日常任务(尤其是陌生任务)至关重要。然而,当前生成式脚本学习方法严重依赖结构化的文本和/或图像描述的先前步骤,或局限于特定领域,导致与真实用户场景存在差距。为克服这些局限,我们提出新的基准挑战——MultiScript,包含两项面向任务的多模态脚本学习任务:(1)多模态脚本生成,以及(2)后续步骤预测。两项任务的输入均为目标任务名称和展示已完成步骤的视频,预期输出分别为:(1)基于演示视频的结构化文本步骤描述序列,以及(2)单个后续步骤的文本描述。基于WikiHow构建的MultiScript覆盖19个不同领域的6,655项人类日常任务,包含视频与文本描述的多模态脚本。为建立MultiScript的基线性能,我们提出两种知识引导的多模态生成框架,通过引入从大语言模型(如Vicuna)中提示的任务相关知识。实验结果表明,所提方法相较竞争基线具有显著性能提升。