Non-Sequential Graph Script Induction via Multimedia Grounding

Online resources such as WikiHow compile a wide range of scripts for performing everyday tasks, which can assist models in learning to reason about procedures. However, the scripts are always presented in a linear manner, which does not reflect the flexibility displayed by people executing tasks in real life. For example, in the CrossTask Dataset, 64.5% of consecutive step pairs are also observed in the reverse order, suggesting their ordering is not fixed. In addition, each step has an average of 2.56 frequent next steps, demonstrating "branching". In this paper, we propose the new challenging task of non-sequential graph script induction, aiming to capture optional and interchangeable steps in procedural planning. To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks. In particular, we design a multimodal framework to ground procedural videos to WikiHow textual steps and thus transform each video into an observed step path on the latent ground truth graph script. This key transformation enables us to train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence. Our best model outperforms the strongest pure text/vision baselines by 17.52% absolute gains on F1@3 for next step prediction and 13.8% absolute gains on Acc@1 for partial sequence completion. Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.

翻译：诸如WikiHow等在线资源为日常任务编排了丰富的操作脚本，可辅助模型学习进行过程推理。然而，这些脚本始终以线性方式呈现，未能反映人们在现实生活中执行任务时展现的灵活性。例如，在CrossTask数据集中，64.5%的连续步骤对在反向顺序中同样出现，表明其顺序并非固定不变。此外，每个步骤平均有2.56个高频后续步骤，呈现出"分支"特性。本文提出非顺序图式脚本归纳这一具有挑战性的新任务，旨在捕捉过程规划中的可选步骤与可互换步骤。为实现给定任务的图式脚本自动归纳，我们提出利用执行任务时松散对齐的视频。具体而言，我们设计了一个多模态框架，将过程视频与WikiHow文本步骤对齐，从而将每个视频转化为隐式真实图式脚本上的观测步骤路径。这一关键转化使得我们能够训练一个脚本知识模型，既可生成所学任务的显式图式脚本，也能根据部分步骤序列预测后续步骤。在后续步骤预测任务中，我们的最优模型在F1@3指标上较最强纯文本/视觉基线提升17.52%绝对值；在部分序列补全任务中，Acc@1指标提升13.8%绝对值。人工评估显示，在捕捉顺序与非顺序步骤关系方面，我们的模型较WikiHow线性基线获得48.76%绝对增益。