We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).
翻译:我们致力于利用在大规模互联网数据上预训练的生成模型的最新进展,在生成视频和语言空间的复杂长时域任务中实现视觉规划。为此,我们提出了视频语言规划(VLP)算法,该算法包含一个树搜索过程,其中我们训练了(i)作为策略和价值函数的视觉-语言模型,以及(ii)作为动力学模型的文本到视频模型。VLP以长时域任务指令和当前图像观测为输入,输出一个详细的视频计划,该计划提供多模态(视频和语言)规范,描述如何完成最终任务。VLP随计算预算的增加而扩展,更多计算时间会带来改进的视频计划,并能在不同机器人领域合成长时域视频计划:从多物体重新排列到多摄像头双臂灵巧操作。生成的视频计划可通过目标条件策略转化为真实机器人动作,该策略以生成视频的每一中间帧为条件。实验表明,在模拟和真实机器人(跨越3个硬件平台)上,VLP较之前方法显著提高了长时域任务的成功率。