Decentralized policy optimization has been commonly used in cooperative multi-agent tasks. However, since all agents are updating their policies simultaneously, from the perspective of individual agents, the environment is non-stationary, resulting in it being hard to guarantee monotonic policy improvement. To help the policy improvement be stable and monotonic, we propose model-based decentralized policy optimization (MDPO), which incorporates a latent variable function to help construct the transition and reward function from an individual perspective. We theoretically analyze that the policy optimization of MDPO is more stable than model-free decentralized policy optimization. Moreover, due to non-stationarity, the latent variable function is varying and hard to be modeled. We further propose a latent variable prediction method to reduce the error of the latent variable function, which theoretically contributes to the monotonic policy improvement. Empirically, MDPO can indeed obtain superior performance than model-free decentralized policy optimization in a variety of cooperative multi-agent tasks.
翻译:去中心化策略优化常用于合作型多智能体任务。然而,由于所有智能体同时更新其策略,从单个智能体的视角来看,环境具有非平稳性,这导致难以保证策略的单调改进。为使策略改进稳定且单调,我们提出了基于模型的去中心化策略优化(MDPO),该方法引入了一个隐变量函数,以帮助从个体视角构建状态转移函数和奖励函数。我们从理论上分析了MDPO的策略优化比无模型的去中心化策略优化更稳定。此外,由于非平稳性,隐变量函数会动态变化且难以建模。我们进一步提出了一种隐变量预测方法,以减少隐变量函数的误差,这在理论上有助于实现单调的策略改进。实验结果表明,在多种合作型多智能体任务中,MDPO确实能获得比无模型的去中心化策略优化更优的性能。