To make progress towards multi-modal AI assistants which can guide users to achieve complex multi-step goals, we propose the task of Visual Planning for Assistance (VPA). Given a goal briefly described in natural language, e.g., "make a shelf", and a video of the user's progress so far, the aim of VPA is to obtain a plan, i.e., a sequence of actions such as "sand shelf", "paint shelf", etc., to achieve the goal. This requires assessing the user's progress from the untrimmed video, and relating it to the requirements of underlying goal, i.e., relevance of actions and ordering dependencies amongst them. Consequently, this requires handling long video history, and arbitrarily complex action dependencies. To address these challenges, we decompose VPA into video action segmentation and forecasting. We formulate the forecasting step as a multi-modal sequence modeling problem and present Visual Language Model based Planner (VLaMP), which leverages pre-trained LMs as the sequence model. We demonstrate that VLaMP performs significantly better than baselines w.r.t all metrics that evaluate the generated plan. Moreover, through extensive ablations, we also isolate the value of language pre-training, visual observations, and goal information on the performance. We will release our data, model, and code to enable future research on visual planning for assistance.
翻译:为推进多模态AI助手的发展,使其能够引导用户完成复杂的多步骤目标,我们提出视觉辅助规划(VPA)任务。给定一个用自然语言简要描述的目标(例如“制作一个架子”)以及用户当前进度的视频,VPA的目标是获取一个规划,即一系列动作(如“打磨架子”、“油漆架子”等)以实现该目标。这需要从未经裁剪的视频中评估用户进度,并将其与潜在目标的要求联系起来,即动作的相关性及其之间的排序依赖关系。因此,这要求处理长视频历史以及任意复杂的动作依赖。为应对这些挑战,我们将VPA分解为视频动作分割和预测。我们将预测步骤形式化为一个多模态序列建模问题,并提出了基于视觉语言模型的规划器(VLaMP),该模型利用预训练语言模型作为序列模型。我们证明,在所有评估生成规划的指标上,VLaMP显著优于基线方法。此外,通过广泛的消融实验,我们分离了语言预训练、视觉观察和目标信息对性能的贡献。我们将发布数据、模型和代码,以推动面向辅助的视觉规划的未来研究。