Pre-trained text-to-image generative models can produce diverse, semantically rich, and realistic images from natural language descriptions. Compared with language, images usually convey information with more details and less ambiguity. In this study, we propose Learning from the Void (LfVoid), a method that leverages the power of pre-trained text-to-image models and advanced image editing techniques to guide robot learning. Given natural language instructions, LfVoid can edit the original observations to obtain goal images, such as "wiping" a stain off a table. Subsequently, LfVoid trains an ensembled goal discriminator on the generated image to provide reward signals for a reinforcement learning agent, guiding it to achieve the goal. The ability of LfVoid to learn with zero in-domain training on expert demonstrations or true goal observations (the void) is attributed to the utilization of knowledge from web-scale generative models. We evaluate LfVoid across three simulated tasks and validate its feasibility in the corresponding real-world scenarios. In addition, we offer insights into the key considerations for the effective integration of visual generative models into robot learning workflows. We posit that our work represents an initial step towards the broader application of pre-trained visual generative models in the robotics field. Our project page: https://lfvoid-rl.github.io/.
翻译:预训练的文本到图像生成模型能够从自然语言描述中生成多样、语义丰富且逼真的图像。与语言相比,图像通常能传递更多细节且歧义更少。在本研究中,我们提出了"从虚无中学习"(LfVoid)方法,该方法利用预训练文本到图像模型和先进图像编辑技术的优势来引导机器人学习。给定自然语言指令,LfVoid能够编辑原始观测图像以获得目标图像,例如"擦除"桌面上的污渍。随后,LfVoid在生成的图像上训练一个集成目标判别器,为强化学习代理提供奖励信号,引导其实现目标。LfVoid能够在零领域内训练(即无需专家示范或真实目标观测,即"虚无")的情况下进行学习,这归功于其对网络规模生成模型知识的利用。我们在三个模拟任务上评估了LfVoid,并在相应的真实世界场景中验证了其可行性。此外,我们深入探讨了将视觉生成模型有效集成到机器人学习流程中的关键考量。我们认为,我们的工作代表了预训练视觉生成模型在机器人领域更广泛应用的第一步。我们的项目页面:https://lfvoid-rl.github.io/。