Learning to coordinate many agents in partially observable and highly dynamic environments requires both informative representations and data-efficient training. To address this challenge, we present a novel model-based multi-agent reinforcement learning framework that unifies joint state-action representation learning with imaginative roll-outs. We design a world model trained with variational auto-encoders and augment the model using the state-action learned embedding (SALE). SALE is injected into both the imagination module that forecasts plausible future roll-outs and the joint agent network whose individual action values are combined through a mixing network to estimate the joint action-value function. By coupling imagined trajectories with SALE-based action values, the agents acquire a richer understanding of how their choices influence collective outcomes, leading to improved long-term planning and optimization under limited real-environment interactions. Empirical studies on well-established multi-agent benchmarks, including StarCraft II Micro-Management, Multi-Agent MuJoCo, and Level-Based Foraging challenges, demonstrate consistent gains of our method over baseline algorithms and highlight the effectiveness of joint state-action learned embeddings within a multi-agent model-based paradigm.
翻译:在部分可观测且高度动态的环境中学习协调多个智能体,既需要信息丰富的表示,也需要数据高效的训练。为应对这一挑战,我们提出了一种新颖的基于模型的多智能体强化学习框架,该框架将联合状态-动作表示学习与想象推演统一起来。我们设计了一个通过变分自编码器训练的世界模型,并利用状态-动作学习嵌入(SALE)对该模型进行增强。SALE被注入到两个模块中:一是预测合理未来推演的想象模块,二是联合智能体网络——该网络中各个智能体的动作价值通过一个混合网络组合,以估计联合动作价值函数。通过将想象的轨迹与基于SALE的动作价值耦合,智能体能够更深入地理解其选择如何影响集体结果,从而在有限的真实环境交互下实现更好的长期规划和优化。在成熟的多智能体基准测试(包括《星际争霸II》微操、多智能体MuJoCo和基于等级的觅食挑战)上进行的实证研究表明,我们的方法相较于基线算法取得了持续的性能提升,并凸显了联合状态-动作学习嵌入在多智能体基于模型范式中的有效性。