Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue

Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM-based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural-language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world-model alignment rather than superficial coordination, we propose a framework for measuring world-model alignment defined over per-agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief-sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world-model alignment, and identify where current models fall on this spectrum.

翻译：摘要：具身智能体之间的有效协作不仅需要在共享环境中行动，还要求沟通植根于每个智能体对世界不断演化的理解。当智能体只能部分观察其周围环境时，无沟通的协调在理论上异常困难，但原则上，沟通可以通过允许智能体共享观察结果并对其世界模型进行对齐来弥补这一差距。在这项工作中，我们检验了基于大语言模型的具身智能体是否真正实现了沟通能力。我们对PARTNR（一个面向协作家庭机器人的基准测试）进行了扩展，引入了一个自然语言对话通道，使得部分可观测环境中的两个智能体能够在任务执行过程中进行沟通。为了评估对话是否导致了真正的世界模型对齐而非表面协调，我们提出了一套衡量世界模型对齐程度的框架，该框架基于每个智能体的世界图进行定义，包括：观测收敛性（私有世界模型是否随时间对齐？）、信息新颖性（消息是否传达了伙伴缺乏的信息？）以及对信念敏感的消息传递（智能体是否对伙伴已知信息进行建模？）。我们在三种大语言模型上的实验表明，与无声协调相比，对话将动作冲突减少了40至83个百分点，但降低了任务成功率。利用我们的指标，我们刻画了表面协调与真正世界模型对齐之间的差距，并确定了当前模型在此频谱上的位置。