We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-centric generative Planning (FLIP), a model-based planning algorithm on visual space that features three key modules: 1. a multi-modal flow generation model as the general-purpose action proposal module; 2. a flow-conditioned video generation model as the dynamics module; and 3. a vision-language representation learning model as the value module. Given an initial image and language instruction as the goal, FLIP can progressively search for long-horizon flow and video plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation, and the dense flow information also provides rich guidance for long-horizon video generation. In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution. Experiments on diverse benchmarks demonstrate that FLIP can improve both the success rates and quality of long-horizon video plan synthesis and has the interactive world model property, opening up wider applications for future works.
翻译:我们旨在为世界模型开发一种基于模型的规划框架,该框架能够随着模型和数据预算的增加而扩展,适用于仅需语言和视觉输入的通用操作任务。为此,我们提出了以流为中心的生成式规划(FLIP),这是一种基于视觉空间的模型规划算法,其核心包含三个关键模块:1. 作为通用动作提议模块的多模态流生成模型;2. 作为动力学模块的流条件视频生成模型;3. 作为价值模块的视觉-语言表征学习模型。给定初始图像和语言指令作为目标,FLIP能够逐步搜索长时程的流与视频规划,以最大化折现回报来完成指定任务。FLIP能够以图像流作为通用动作表示,跨物体、机器人和任务合成长时程规划,同时密集的流信息也为长时程视频生成提供了丰富的指导。此外,合成的流与视频规划可进一步指导底层控制策略的训练,以用于机器人执行。在多种基准测试上的实验表明,FLIP能够提升长时程视频规划合成的成功率与质量,并具备交互式世界模型的特性,为未来工作开辟了更广泛的应用前景。