In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong generalization and sample efficiency in single-agent settings, their application to MARL remains limited by an inability to handle teammate-induced uncertainty. We propose a new perspective: treat teammates as structured, learnable components within the agent's world model. We introduce an architecture that factorizes the latent state of a Dreamer-style recurrent state-space model (RSSM) into environment and teammate components, and learns an auxiliary Theory-of-Mind (ToM) head to infer latent embeddings of partner behavior such as character, intent, and predicted actions from partial trajectories. These teammate latents condition the actor and critic, enabling the agent to imagine and adapt to diverse collaborators. We outline how this approach can support zero-shot and few-shot coordination in partially observable settings and propose a set of benchmarks and evaluation protocols to assess its impact. This work positions world models as not only predictors of environmental dynamics, but as simulators of social behavior, opening new directions for generalizable, human-compatible AI.
翻译:在合作式多智能体强化学习(MARL)中,智能体必须与内部策略和意图不可直接观测的伙伴进行协调。虽然Dreamer等世界模型在单智能体场景中展现出了强大的泛化能力和样本效率,但其在MARL中的应用仍受限于无法处理队友引发的不确定性。我们提出一种新视角:将队友视为智能体世界模型中的结构化可学习组件。我们引入一种架构,将Dreamer型循环状态空间模型(RSSM)的潜在状态分解为环境与队友两个组件,并通过学习辅助的心理理论(ToM)头,从部分轨迹中推断伙伴行为的潜在嵌入(如角色、意图及预测动作)。这些队友潜在表征将条件化演员与评论家网络,使智能体能够想象并适应多种协作者。我们概述了该方法如何支持部分可观测场景中的零样本与少样本协调,并提出了一套评估其影响的基准测试与评估协议。本研究将世界模型定位为不仅是环境动力学的预测器,更是社会行为的模拟器,为可泛化、与人类兼容的AI开辟了新方向。