We extend trust region policy optimization (TRPO) to multi-agent reinforcement learning (MARL) problems. We show that the policy update of TRPO can be transformed into a distributed consensus optimization problem for multi-agent cases. By making a series of approximations to the consensus optimization model, we propose a decentralized MARL algorithm, which we call multi-agent TRPO (MATRPO). This algorithm can optimize distributed policies based on local observations and private rewards. The agents do not need to know observations, rewards, policies or value/action-value functions of other agents. The agents only share a likelihood ratio with their neighbors during the training process. The algorithm is fully decentralized and privacy-preserving. Our experiments on two cooperative games demonstrate its robust performance on complicated MARL tasks.
翻译:我们将信任区域策略优化(TRPO)扩展至多智能体强化学习(MARL)问题。研究表明,TRPO的策略更新可转化为多智能体场景下的分布式共识优化问题。通过对共识优化模型进行一系列近似,我们提出了一种去中心化的MARL算法,称为多智能体TRPO(MATRPO)。该算法能基于局部观测和私有奖励优化分布式策略。智能体无需知晓其他智能体的观测、奖励、策略或价值/动作价值函数,仅在训练过程中与相邻智能体共享似然比。该算法完全去中心化且保护隐私。在两个合作博弈实验中的结果表明,该算法在处理复杂MARL任务时展现出稳健性能。