In recent years, model-based reinforcement learning (MBRL) has emerged as a solution to address sample complexity in multi-agent reinforcement learning (MARL) by modeling agent-environment dynamics to improve sample efficiency. However, most MBRL methods assume complete and continuous observations from each agent during the inference stage, which can be overly idealistic in practical applications. A novel model-based MARL approach called RMIO is introduced to address this limitation, specifically designed for scenarios where observation is lost in some agent. RMIO leverages the world model to reconstruct missing observations, and further reduces reconstruction errors through inter-agent information integration to ensure stable multi-agent decision-making. Secondly, unlike CTCE methods such as MAMBA, RMIO adopts the CTDE paradigm in standard environment, and enabling limited communication only when agents lack observation data, thereby reducing reliance on communication. Additionally, RMIO improves asymptotic performance through strategies such as reward smoothing, a dual-layer experience replay buffer, and an RNN-augmented policy model, surpassing previous work. Our experiments conducted in both the SMAC and MaMuJoCo environments demonstrate that RMIO outperforms current state-of-the-art approaches in terms of asymptotic convergence performance and policy robustness, both in standard mission settings and in scenarios involving observation loss.
翻译:近年来,基于模型的强化学习(MBRL)通过建模智能体-环境动态来提升样本效率,已成为解决多智能体强化学习(MARL)中样本复杂度问题的一种方案。然而,大多数MBRL方法在推理阶段假设每个智能体都能获得完整且连续的观测,这在实际应用中可能过于理想化。本文提出了一种名为RMIO的新型基于模型MARL方法以解决这一局限,该方法专门为部分智能体观测缺失的场景设计。RMIO利用世界模型重建缺失的观测,并通过智能体间信息融合进一步降低重建误差,以确保稳定的多智能体决策。其次,与MAMBA等CTCE方法不同,RMIO在标准环境中采用CTDE范式,仅在智能体缺乏观测数据时启用受限通信,从而降低对通信的依赖。此外,RMIO通过奖励平滑、双层经验回放缓冲区以及RNN增强的策略模型等策略提升了渐近性能,超越了先前的工作。我们在SMAC和MaMuJoCo环境中进行的实验表明,无论是在标准任务设定还是在涉及观测缺失的场景中,RMIO在渐近收敛性能和策略鲁棒性方面均优于当前最先进的方法。