Value-decomposition methods, which reduce the difficulty of a multi-agent system by decomposing the joint state-action space into local observation-action spaces, have become popular in cooperative multi-agent reinforcement learning (MARL). However, value-decomposition methods still have the problems of tremendous sample consumption for training and lack of active exploration. In this paper, we propose a scalable value-decomposition exploration (SVDE) method, which includes a scalable training mechanism, intrinsic reward design, and explorative experience replay. The scalable training mechanism asynchronously decouples strategy learning with environmental interaction, so as to accelerate sample generation in a MapReduce manner. For the problem of lack of exploration, an intrinsic reward design and explorative experience replay are proposed, so as to enhance exploration to produce diverse samples and filter non-novel samples, respectively. Empirically, our method achieves the best performance on almost all maps compared to other popular algorithms in a set of StarCraft II micromanagement games. A data-efficiency experiment also shows the acceleration of SVDE for sample collection and policy convergence, and we demonstrate the effectiveness of factors in SVDE through a set of ablation experiments.
翻译:值分解方法通过将联合状态-动作空间分解为局部观测-动作空间来降低多智能体系统复杂度,已在协作式多智能体强化学习中广泛应用。然而,这些方法仍存在训练样本消耗巨大与缺乏主动探索的问题。本文提出一种可扩展值分解探索方法(SVDE),包含可扩展训练机制、内在奖励设计与探索性经验回放三个模块。其中,可扩展训练机制通过异步解耦策略学习与环境交互,以MapReduce方式加速样本生成;针对探索不足问题,分别引入内在奖励设计与探索性经验回放以增强探索生成多样性样本并过滤非新颖样本。在星际争霸II微观管理游戏系列地图上的实验表明,相较于其他主流算法,本方法在几乎所有地图上均取得最优性能。数据效率实验进一步验证了SVDE对样本采集与策略收敛的加速效果,并通过消融实验证明了各因素的有效性。