Goal-oriented generative script learning aims to generate subsequent steps to reach a particular goal, which is an essential task to assist robots or humans in performing stereotypical activities. An important aspect of this process is the ability to capture historical states visually, which provides detailed information that is not covered by text and will guide subsequent steps. Therefore, we propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities, as well as presenting the first benchmark containing 5,652 tasks and 79,089 multimedia steps. This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps. We propose to encode visual state changes through a selective multimedia encoder to address the multimedia challenge, transfer knowledge from previously observed tasks using a retrieval-augmented decoder to overcome the induction challenge, and further present distinct information at each step by optimizing a diversity-oriented contrastive learning objective. We define metrics to evaluate both generation and inductive quality. Experiment results demonstrate that our approach significantly outperforms strong baselines.
翻译:面向目标导向的生成式脚本学习旨在生成达成特定目标的后续步骤,这是协助机器人或人类执行常规活动的重要任务。该过程的关键在于通过视觉方式捕捉历史状态——这能提供文本未覆盖的细节信息,并指导后续步骤的生成。为此,我们提出新任务——多媒体生成式脚本学习,通过追踪文本与视觉双模态的历史状态生成后续步骤,并首次构建包含5,652个任务和79,089个多媒体步骤的基准数据集。该任务面临三大挑战:捕捉图像中视觉状态的多媒体挑战、执行未见任务的归纳挑战、以及覆盖单个步骤中不同信息的多样性挑战。我们提出选择性多媒体编码器编码视觉状态变化以应对多媒体挑战,利用检索增强解码器迁移先前观测任务的知识以克服归纳挑战,并通过优化面向多样性的对比学习目标,使每个步骤呈现差异化信息。我们定义了评估生成质量与归纳能力的评价指标。实验结果表明,我们的方法显著优于强基线模型。