PlanLLM: Video Procedure Planning with Refinable Large Language Models

Video procedure planning, i.e., planning a sequence of action steps given the video frames of start and goal states, is an essential ability for embodied AI. Recent works utilize Large Language Models (LLMs) to generate enriched action step description texts to guide action step decoding. Although LLMs are introduced, these methods decode the action steps into a closed-set of one-hot vectors, limiting the model's capability of generalizing to new steps or tasks. Additionally, fixed action step descriptions based on world-level commonsense may contain noise in specific instances of visual states. In this paper, we propose PlanLLM, a cross-modal joint learning framework with LLMs for video procedure planning. We propose an LLM-Enhanced Planning module which fully uses the generalization ability of LLMs to produce free-form planning output and to enhance action step decoding. We also propose Mutual Information Maximization module to connect world-level commonsense of step descriptions and sample-specific information of visual states, enabling LLMs to employ the reasoning ability to generate step sequences. With the assistance of LLMs, our method can both closed-set and open vocabulary procedure planning tasks. Our PlanLLM achieves superior performance on three benchmarks, demonstrating the effectiveness of our designs.

翻译：视频流程规划，即给定起始和目标状态的视频帧，规划一系列动作步骤，是具身智能体的核心能力。近期研究利用大语言模型生成丰富的动作步骤描述文本以指导动作步骤解码。尽管引入了大语言模型，这些方法仍将动作步骤解码为封闭集内的独热向量，限制了模型对新步骤或新任务的泛化能力。此外，基于世界级常识的固定动作步骤描述在特定视觉状态实例中可能包含噪声。本文提出PlanLLM——一种结合大语言模型的跨模态联合学习框架用于视频流程规划。我们设计了LLM增强规划模块，充分利用大语言模型的泛化能力生成自由形式的规划输出并增强动作步骤解码。同时提出互信息最大化模块，将步骤描述的世界级常识与视觉状态的样本特定信息相连接，使大语言模型能够运用推理能力生成步骤序列。在大语言模型的辅助下，我们的方法既能处理封闭集任务，也能完成开放词汇的流程规划任务。PlanLLM在三个基准测试中均取得优越性能，验证了所提设计的有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日