Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle toward developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain's dynamics, and learns the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking, GenRL exhibits strong multi-task generalization performance in several locomotion and manipulation domains. Furthermore, by introducing a data-free RL strategy, it lays the groundwork for foundation model-based RL for generalist embodied agents.
翻译:学习通用具身智能体,使其能够解决不同领域的多种任务,是一个长期存在的问题。强化学习(RL)难以扩展,因为它需要为每个任务设计复杂的奖励函数。相比之下,语言能以更自然的方式指定任务。当前的基础视觉-语言模型(VLMs)通常需要进行微调或其他适配才能发挥作用,这主要是由于显著的领域差距所致。然而,此类领域中多模态数据的缺乏,构成了开发面向具身应用的基础模型的一个障碍。在本工作中,我们通过提出多模态基础世界模型来克服这些问题,该模型能够连接并对齐基础VLMs的表征与用于RL的生成式世界模型的潜在空间,且无需任何语言标注。由此产生的智能体学习框架——GenRL,允许通过视觉和/或语言提示来指定任务,将其在具身领域的动态中进行落地,并在想象中学习相应的行为。通过大规模多任务基准测试评估,GenRL在多个运动与操作领域中展现出强大的多任务泛化性能。此外,通过引入一种无数据RL策略,它为基于基础模型的通用具身智能体强化学习奠定了基础。