We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and tuning the hyperparameters with regards to the number of agents, as predicted by our theoretical analysis.
翻译:我们提出了在合作型多智能体强化学习中对分散式策略进行优化的信任区域边界,该边界在转移动态非平稳时仍然成立。这一新分析为两种近期用于多智能体强化学习的演员-评论家方法(均依赖独立比率,即分别计算每个智能体策略的概率比)的优异性能提供了理论依据。我们证明:尽管独立比率会导致非平稳性,但只要对所有分散式策略施加信任区域约束,仍能获得单调改进保证。我们还表明,可通过基于训练中智能体数量对独立比率进行约束来有效实施该信任区域约束,从而为近端比率裁剪提供理论基础。最后,实验结果支持以下假设:IPPO和MAPPO的卓越性能直接源于在集中训练中通过裁剪施加此类信任区域约束,并依据我们的理论分析按智能体数量调整超参数。