Constructing simulation scenes that are both visually and physically realistic is a problem of practical interest in domains ranging from robotics to computer vision. This problem has become even more relevant as researchers wielding large data-hungry learning methods seek new sources of training data for physical decision-making systems. However, building simulation models is often still done by hand. A graphic designer and a simulation engineer work with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are required for data-driven robotic control, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with 'natural' kinematic and dynamic structures. To attack this problem, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how controllable text-to-image generative models can be used in generating paired training data that allows for modeling of the inverse problem, mapping from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes in simulation with semantic and physical realism. We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images and use these for training robotic control policies. We then robustly deploy in the real world for tasks like articulated object manipulation. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies in the resulting environments.
翻译:构建视觉与物理双重逼真的仿真场景,在从机器人学到计算机视觉的多个领域都具有实际意义。随着研究人员运用需要大量数据的学习方法,为物理决策系统寻找新的训练数据来源,这一问题变得愈发重要。然而,仿真模型的构建目前通常仍依赖手工完成。图形设计师和仿真工程师利用预定义的资产,构建具有逼真动力学和运动学特性的丰富场景。虽然这种方法可以扩展到少量场景,但要实现数据驱动机器人控制所需的泛化能力,我们需要一种能够合成大量逼真场景的流程,这些场景需包含"自然"的运动学和动力学结构。为解决此问题,我们开发了从自然图像推断结构并生成仿真场景的模型,从而能够基于网络规模的数据集进行可扩展的场景生成。为了训练这些图像到仿真模型,我们展示了如何利用可控的文本到图像生成模型来生成配对的训练数据,从而支持对逆问题进行建模——即从逼真图像映射回完整的场景模型。我们展示了这一范式如何使我们能够在仿真中构建具有语义和物理真实性的海量场景数据集。我们提出了一个集成的端到端流程,该流程能从真实世界图像生成包含关节式运动学和动力学结构的完整仿真场景,并利用这些场景来训练机器人控制策略。随后,我们在现实世界中稳健地部署这些策略,以执行如关节物体操作等任务。通过这项工作,我们不仅提供了一个用于大规模生成仿真环境的流程,还提供了一个集成系统,用于在生成的仿真环境中训练稳健的机器人控制策略。