Offline multi-agent reinforcement learning (MARL) aims to learn effective multi-agent policies from pre-collected datasets, which is an important step toward the deployment of multi-agent systems in real-world applications. However, in practice, each individual behavior policy that generates multi-agent joint trajectories usually has a different level of how well it performs. e.g., an agent is a random policy while other agents are medium policies. In the cooperative game with global reward, one agent learned by existing offline MARL often inherits this random policy, jeopardizing the performance of the entire team. In this paper, we investigate offline MARL with explicit consideration on the diversity of agent-wise trajectories and propose a novel framework called Shared Individual Trajectories (SIT) to address this problem. Specifically, an attention-based reward decomposition network assigns the credit to each agent through a differentiable key-value memory mechanism in an offline manner. These decomposed credits are then used to reconstruct the joint offline datasets into prioritized experience replay with individual trajectories, thereafter agents can share their good trajectories and conservatively train their policies with a graph attention network (GAT) based critic. We evaluate our method in both discrete control (i.e., StarCraft II and multi-agent particle environment) and continuous control (i.e, multi-agent mujoco). The results indicate that our method achieves significantly better results in complex and mixed offline multi-agent datasets, especially when the difference of data quality between individual trajectories is large.
翻译:离线多智能体强化学习旨在从预先收集的数据集中学习有效的多智能体策略,这是推动多智能体系统在实际应用中部署的重要步骤。然而在实际应用中,生成多智能体联合轨迹的各个体行为策略通常具有不同的执行水平,例如某智能体采用随机策略而其他智能体采用中等策略。在具有全局奖励的合作博弈场景中,现有离线多智能体强化学习方法训练的智能体往往继承这种随机策略,从而损害整个团队的性能。本文针对明确考虑智能体轨迹多样性的离线多智能体强化学习问题展开研究,并提出一种名为共享个体轨迹的新型框架来解决该问题。具体而言,基于注意力的奖励分解网络通过可微键值记忆机制以离线方式为每个智能体分配信用分值。这些分解后的信用分值随后用于将联合离线数据集重构为具有个体轨迹的优先经验回放池,从而使智能体能够共享优质轨迹并通过基于图注意力网络的评论家进行保守策略训练。我们在离散控制任务(即星际争霸II和多智能体粒子环境)与连续控制任务(即多智能体MuJoCo)中评估了该方法。实验结果表明,在复杂且混合的离线多智能体数据集中,特别是当个体轨迹间数据质量差异较大时,该方法取得了显著更优的效果。