In this paper, we propose a distributed zeroth-order policy optimization method for Multi-Agent Reinforcement Learning (MARL). Existing MARL algorithms often assume that every agent can observe the states and actions of all the other agents in the network. This can be impractical in large-scale problems, where sharing the state and action information with multi-hop neighbors may incur significant communication overhead. The advantage of the proposed zeroth-order policy optimization method is that it allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards that depend on partial state and action information only and can be obtained using consensus. Specifically, to calculate the local policy gradients, we develop a new distributed zeroth-order policy gradient estimator that relies on one-point residual-feedback which, compared to existing zeroth-order estimators that also rely on one-point feedback, significantly reduces the variance of the policy gradient estimates improving, in this way, the learning performance. We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to the neighborhood of a policy that is a stationary point of the global objective function. The size of this neighborhood depends on the agents' learning rates, the exploration parameters, and the number of consensus steps used to calculate the local estimates of the global accumulated rewards. Moreover, we provide numerical experiments that demonstrate that our new zeroth-order policy gradient estimator is more sample-efficient compared to other existing one-point estimators.
翻译:本文提出了一种用于多智能体强化学习(MARL)的分布式零阶策略优化方法。现有MARL算法通常假设每个智能体都能观测到网络中所有其他智能体的状态和动作。这在大规模问题中可能不切实际,因为与多跳邻居共享状态和动作信息会带来显著的通信开销。所提出的零阶策略优化方法的优势在于,它允许智能体利用仅依赖部分状态和动作信息且可通过一致性获得的全局累积奖励局部估计,计算更新其局部策略函数所需的局部策略梯度。具体而言,为计算局部策略梯度,我们开发了一种新的分布式零阶策略梯度估计器,该估计器基于单点残差反馈;与同样依赖单点反馈的现有零阶估计器相比,它显著降低了策略梯度估计的方差,从而提升了学习性能。我们证明,采用恒定步长的分布式零阶策略优化方法会收敛到全局目标函数驻点策略的邻域内。该邻域的大小取决于智能体的学习率、探索参数以及用于计算全局累积奖励局部估计的一致性步骤数。此外,我们通过数值实验证明,与其他现有单点估计器相比,我们提出的新型零阶策略梯度估计器具有更高的样本效率。