In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video conditioned on the world state. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. We evaluate our methods on three challenging benchmarks with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed methods. More videos can be found at https://embodied-agi.cs.umass.edu/combo/.
翻译:本文研究了具身多智能体协作问题,其中分散的智能体仅能基于自身视角的世界观进行协作。与单智能体场景中学习世界动态不同,为了在此设定下进行有效规划,我们必须仅基于局部自我中心视觉观测,模拟以任意数量智能体动作为条件的世界动态。为解决这种部分可观测性问题,我们首先训练生成模型,以基于局部自我中心观测估计整体世界状态。为了能够准确模拟多组动作对该世界状态的影响,我们随后提出通过学习组合式世界模型来实现多智能体协作,其核心在于分解多智能体天然可组合的联合动作,并以世界状态为条件组合生成视频。通过利用该组合式世界模型,并结合视觉语言模型推断其他智能体的动作,我们可以采用树搜索流程整合这些模块,实现在线协作规划。我们在包含2-4个智能体的三个具有挑战性的基准测试中评估了我们的方法。结果表明,我们的组合式世界模型具有显著效果,该框架能使具身智能体在不同任务中与各类智能体高效协作,且能适应任意数量的智能体,展现了所提方法的广阔前景。更多视频请访问 https://embodied-agi.cs.umass.edu/combo/。