Video Language Planning

Yilun Du,Mengjiao Yang,Pete Florence,Fei Xia,Ayzaan Wahid,Brian Ichter,Pierre Sermanet,Tianhe Yu,Pieter Abbeel,Joshua B. Tenenbaum,Leslie Kaelbling,Andy Zeng,Jonathan Tompson

from arxiv, https://video-language-planning.github.io/

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

翻译：我们致力于利用在大规模互联网数据上预训练的生成模型的最新进展，在生成视频和语言空间的复杂长时域任务中实现视觉规划。为此，我们提出了视频语言规划（VLP）算法，该算法包含一个树搜索过程，其中我们训练了（i）作为策略和价值函数的视觉-语言模型，以及（ii）作为动力学模型的文本到视频模型。VLP以长时域任务指令和当前图像观测为输入，输出一个详细的视频计划，该计划提供多模态（视频和语言）规范，描述如何完成最终任务。VLP随计算预算的增加而扩展，更多计算时间会带来改进的视频计划，并能在不同机器人领域合成长时域视频计划：从多物体重新排列到多摄像头双臂灵巧操作。生成的视频计划可通过目标条件策略转化为真实机器人动作，该策略以生成视频的每一中间帧为条件。实验表明，在模拟和真实机器人（跨越3个硬件平台）上，VLP较之前方法显著提高了长时域任务的成功率。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日