In fully cooperative multi-agent reinforcement learning (MARL) settings, environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of other agents. To address the above issues, we proposed a unified framework, called DFAC, for integrating distributional RL with value function factorization methods. This framework generalizes expected value function factorization methods to enable the factorization of return distributions. To validate DFAC, we first demonstrate its ability to factorize the value functions of a simple matrix game with stochastic rewards. Then, we perform experiments on all Super Hard maps of the StarCraft Multi-Agent Challenge and six self-designed Ultra Hard maps, showing that DFAC is able to outperform a number of baselines.
翻译:在完全合作的多智能体强化学习(MARL)设定中,由于每个智能体的局部可观测性以及其他智能体不断变化的策略,环境具有高度随机性。为解决上述问题,我们提出了一种名为DFAC的统一框架,用于将分布RL与价值函数分解方法相结合。该框架将期望价值函数分解方法泛化,以实现回报分布的分解。为验证DFAC,我们首先展示了其在具有随机回报的简单矩阵游戏中分解价值函数的能力。随后,我们在星际争霸多智能体挑战赛的所有超难地图及六张自行设计的极限难度地图上进行了实验,结果表明DFAC能够超越多种基线方法。