Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Goal-oriented planning, or anticipating a series of actions that transition an agent from its current state to a predefined objective, is crucial for developing intelligent assistants aiding users in daily procedural tasks. The problem presents significant challenges due to the need for comprehensive knowledge of temporal and hierarchical task structures, as well as strong capabilities in reasoning and planning. To achieve this, prior work typically relies on extensive training on the target dataset, which often results in significant dataset bias and a lack of generalization to unseen tasks. In this work, we introduce VidAssist, an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos. VidAssist leverages large language models (LLMs) as both the knowledge base and the assessment tool for generating and evaluating action plans, thus overcoming the challenges of acquiring procedural knowledge from small-scale, low-diversity datasets. Moreover, VidAssist employs a breadth-first search algorithm for optimal plan generation, in which a composite of value functions designed for goal-oriented planning is utilized to assess the predicted actions at each step. Extensive experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups, e.g., visual planning for assistance (VPA) and procedural planning (PP), and achieves remarkable performance in zero-shot and few-shot setups. Specifically, our few-shot model outperforms the prior fully supervised state-of-the-art method by +7.7% in VPA and +4.81% PP task on the COIN dataset while predicting 4 future actions. Code, and models are publicly available at https://sites.google.com/view/vidassist.

翻译：目标导向规划，即预测一系列能将智能体从当前状态转移至预定目标的行为序列，对于开发辅助用户完成日常程序性任务的智能助手至关重要。该问题因需要掌握时序性与层次化的任务结构知识，并具备强大的推理与规划能力，而面临重大挑战。为实现这一目标，先前的研究通常依赖于对目标数据集进行大量训练，这往往导致显著的数据集偏差，并难以泛化至未见过的任务。在本工作中，我们提出了VidAssist，一个为教学视频中的零样本/少样本目标导向规划设计的集成框架。VidAssist利用大语言模型（LLMs）同时作为知识库和评估工具，用于生成和评估行动方案，从而克服了从小规模、低多样性数据集中获取程序性知识的挑战。此外，VidAssist采用广度优先搜索算法生成最优方案，其中利用专为目标导向规划设计的复合价值函数来评估每一步预测的行动。大量实验表明，VidAssist为不同的目标导向规划设置（例如，辅助视觉规划和程序规划）提供了一个统一框架，并在零样本和少样本设置中取得了卓越性能。具体而言，在预测未来4个动作时，我们的少样本模型在COIN数据集上的辅助视觉规划任务中比先前完全监督的最先进方法高出+7.7%，在程序规划任务中高出+4.81%。代码和模型已在 https://sites.google.com/view/vidassist 公开。