Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this issue, we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos. We illustrate how this factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components. We further show how such a factorization enables us to add additional multimodal goals, allowing us to specify a video we wish to generate given both natural language instructions and a goal image. Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.
翻译:文本到视频模型在机器人决策中展现出巨大潜力,既能想象未来动作的逼真规划,又能实现精确的环境模拟。然而这类模型的主要问题在于泛化能力——模型仅能合成与训练时所见的语言指令相似的视频。这在决策过程中存在显著局限性,因为我们需要强大的世界模型来合成包含未见物体与动作组合的规划方案,从而解决新环境中的未知任务。为解决该问题,我们提出RoboDreamer——一种通过分解视频生成过程来学习组合式世界模型的创新方法。我们利用语言天然的组合特性将指令解析为底层原语集合,并以此条件约束多个模型进行视频生成。研究表明,这种分解机制通过将新自然语言指令表述为已知组件的组合,自然实现了组合泛化能力。我们进一步证明,该分解方法还能整合多模态目标,使得在给定自然语言指令和目标图像时能生成指定视频。我们的方法能成功合成本文在RT-X数据集中未见目标的视频规划,实现仿真环境中机器人的成功执行,且在视频生成任务中显著优于整体化基线方法。