Video procedure planning, i.e., planning a sequence of action steps given the video frames of start and goal states, is an essential ability for embodied AI. Recent works utilize Large Language Models (LLMs) to generate enriched action step description texts to guide action step decoding. Although LLMs are introduced, these methods decode the action steps into a closed-set of one-hot vectors, limiting the model's capability of generalizing to new steps or tasks. Additionally, fixed action step descriptions based on world-level commonsense may contain noise in specific instances of visual states. In this paper, we propose PlanLLM, a cross-modal joint learning framework with LLMs for video procedure planning. We propose an LLM-Enhanced Planning module which fully uses the generalization ability of LLMs to produce free-form planning output and to enhance action step decoding. We also propose Mutual Information Maximization module to connect world-level commonsense of step descriptions and sample-specific information of visual states, enabling LLMs to employ the reasoning ability to generate step sequences. With the assistance of LLMs, our method can both closed-set and open vocabulary procedure planning tasks. Our PlanLLM achieves superior performance on three benchmarks, demonstrating the effectiveness of our designs.
翻译:视频流程规划,即给定起始和目标状态的视频帧,规划一系列动作步骤,是具身智能体的核心能力。近期研究利用大语言模型生成丰富的动作步骤描述文本以指导动作步骤解码。尽管引入了大语言模型,这些方法仍将动作步骤解码为封闭集内的独热向量,限制了模型对新步骤或新任务的泛化能力。此外,基于世界级常识的固定动作步骤描述在特定视觉状态实例中可能包含噪声。本文提出PlanLLM——一种结合大语言模型的跨模态联合学习框架用于视频流程规划。我们设计了LLM增强规划模块,充分利用大语言模型的泛化能力生成自由形式的规划输出并增强动作步骤解码。同时提出互信息最大化模块,将步骤描述的世界级常识与视觉状态的样本特定信息相连接,使大语言模型能够运用推理能力生成步骤序列。在大语言模型的辅助下,我们的方法既能处理封闭集任务,也能完成开放词汇的流程规划任务。PlanLLM在三个基准测试中均取得优越性能,验证了所提设计的有效性。