RoboDreamer: Learning Compositional World Models for Robot Imagination

Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this issue, we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos. We illustrate how this factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components. We further show how such a factorization enables us to add additional multimodal goals, allowing us to specify a video we wish to generate given both natural language instructions and a goal image. Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.

翻译：文本到视频模型在机器人决策中展现出巨大潜力，既能想象未来动作的逼真规划，又能实现精确的环境模拟。然而这类模型的主要问题在于泛化能力——模型仅能合成与训练时所见的语言指令相似的视频。这在决策过程中存在显著局限性，因为我们需要强大的世界模型来合成包含未见物体与动作组合的规划方案，从而解决新环境中的未知任务。为解决该问题，我们提出RoboDreamer——一种通过分解视频生成过程来学习组合式世界模型的创新方法。我们利用语言天然的组合特性将指令解析为底层原语集合，并以此条件约束多个模型进行视频生成。研究表明，这种分解机制通过将新自然语言指令表述为已知组件的组合，自然实现了组合泛化能力。我们进一步证明，该分解方法还能整合多模态目标，使得在给定自然语言指令和目标图像时能生成指定视频。我们的方法能成功合成本文在RT-X数据集中未见目标的视频规划，实现仿真环境中机器人的成功执行，且在视频生成任务中显著优于整体化基线方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日