Cooperative MARL often assumes frequent access to global information in a data buffer, such as team rewards or other agents' actions, which is typically unrealistic in decentralized MARL systems due to high communication costs. When communication is limited, agents must rely on outdated information to estimate gradients and update their policies. A common approach to handle missing data is called importance sampling, in which we reweigh old data from a base policy to estimate gradients for the current policy. However, it quickly becomes unstable when the communication is limited (i.e. missing data probability is high), so that the base policy in importance sampling is outdated. To address this issue, we propose a technique called base policy prediction, which utilizes old gradients to predict the policy update and collect samples for a sequence of base policies, which reduces the gap between the base policy and the current policy. This approach enables effective learning with significantly fewer communication rounds, since the samples of predicted base policies could be collected within one communication round. Theoretically, we show that our algorithm converges to an $\varepsilon$-Nash equilibrium in potential games with only $O(\varepsilon^{-3/4})$ communication rounds and $O(poly(\max_i |A_i|)\varepsilon^{-11/4})$ samples, improving existing state-of-the-art results in communication cost, as well as sample complexity without the exponential dependence on the joint action space size. We also extend these results to general Markov Cooperative Games to find an agent-wise local maximum. Empirically, we test the base policy prediction algorithm in both simulated games and MAPPO for complex environments.
翻译:协作多智能体强化学习通常假设能够频繁访问数据缓冲区中的全局信息,例如团队奖励或其他智能体的动作,这在去中心化的多智能体强化学习系统中由于高昂的通信成本通常是不现实的。当通信受限时,智能体必须依赖过时的信息来估计梯度并更新其策略。处理缺失数据的一种常见方法称为重要性采样,即通过重新加权来自基础策略的旧数据来估计当前策略的梯度。然而,当通信受限(即数据缺失概率较高)时,重要性采样中的基础策略会迅速过时,导致方法不稳定。为解决这一问题,我们提出了一种称为基础策略预测的技术,该技术利用旧梯度来预测策略更新,并为一系列基础策略收集样本,从而缩小基础策略与当前策略之间的差距。由于预测基础策略的样本可以在一次通信轮次内收集,该方法能够以显著更少的通信轮次实现有效学习。理论上,我们证明了在势博弈中,我们的算法仅需 $O(\varepsilon^{-3/4})$ 次通信轮次和 $O(poly(\max_i |A_i|)\varepsilon^{-11/4})$ 个样本即可收敛到 $\varepsilon$-纳什均衡,在通信成本以及样本复杂度方面改进了现有最优结果,且避免了样本复杂度对联合动作空间大小的指数依赖。我们还将这些结果推广到一般马尔可夫协作博弈,以寻找智能体层面的局部最优解。在实验中,我们在模拟游戏和复杂环境下的 MAPPO 中测试了基础策略预测算法。