Value function factorization via centralized training and decentralized execution is promising for solving cooperative multi-agent reinforcement tasks. One of the approaches in this area, QMIX, has become state-of-the-art and achieved the best performance on the StarCraft II micromanagement benchmark. However, the monotonic-mixing of per agent estimates in QMIX is known to restrict the joint action Q-values it can represent, as well as the insufficient global state information for single agent value function estimation, often resulting in suboptimality. To this end, we present LSF-SAC, a novel framework that features a variational inference-based information-sharing mechanism as extra state information to assist individual agents in the value function factorization. We demonstrate that such latent individual state information sharing can significantly expand the power of value function factorization, while fully decentralized execution can still be maintained in LSF-SAC through a soft-actor-critic design. We evaluate LSF-SAC on the StarCraft II micromanagement challenge and demonstrate that it outperforms several state-of-the-art methods in challenging collaborative tasks. We further set extensive ablation studies for locating the key factors accounting for its performance improvements. We believe that this new insight can lead to new local value estimation methods and variational deep learning algorithms. A demo video and code of implementation can be found at https://sites.google.com/view/sacmm.
翻译:通过集中训练与分布式执行实现的价值函数分解,是解决协作多智能体强化学习任务的有效方法。该领域的代表性方法QMIX已成为当前最优技术,并在星际争霸II微操作基准测试中取得了最佳性能。然而,QMIX中基于单调混合的单个智能体估计方式,既限制了其能表征的联合动作Q值范围,又因缺乏全局状态信息导致单智能体价值函数估计不充分,从而引发次优性问题。为此,我们提出LSF-SAC这一新型框架,其创新性地引入基于变分推理的信息共享机制作为额外状态信息,辅助单个智能体进行价值函数分解。实验证明,这种潜在个体状态信息共享机制能显著扩展价值函数分解的能力,同时通过软演员-评论家设计,LSF-SAC仍可维持完全分布式的执行模式。我们在星际争霸II微操作挑战任务中评估LSF-SAC,结果表明其在复杂协作任务中优于多个当前最优方法。通过详尽的消融实验,我们进一步定位了性能提升的关键因素。我们相信,这一新见解将催生新型局部价值估计方法与变分深度学习算法。演示视频与代码实现可在https://sites.google.com/view/sacmm 获取。