Learning Interactive Real-World Simulators

Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as "open the drawer" and low-level controls from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.

翻译：基于互联网数据训练的生成模型已经彻底改变了文本、图像和视频内容的创建方式。生成模型的下一个里程碑，或许在于能够模拟人类、机器人及其他交互智能体执行动作时所产生的逼真体验。真实世界模拟器的应用范围广泛，从游戏和电影中的可控内容创作，到完全在模拟环境中训练、可直接部署于现实世界的具身智能体。我们探索了通过生成建模学习通用真实世界交互模拟器（UniSim）的可能性。我们首先提出一个重要观察：可用于学习真实世界模拟器的自然数据集通常在多个维度上具有丰富性（例如，图像数据中物体种类繁多，机器人数据中动作采样密集，导航数据中运动模式多样）。通过对提供不同体验维度的多样化数据集进行精心整合，我们能够从原本静态的场景和物体中，模拟出高级指令（如“打开抽屉”）与低级控制指令所产生的视觉结果。我们利用该模拟器同时训练高级视觉-语言策略与低级强化学习策略，两者均可在完全于模拟环境中训练后，以零样本方式部署于真实世界。我们还表明，其他类型的智能模型（如视频描述生成模型）也能从模拟体验训练中受益，从而开启了更广泛的应用前景。视频演示可在 https://universal-simulator.github.io 查看。