This paper studies networked multi-agent reinforcement learning (NMARL) with interdependent rewards and coupled policies. In this setting, each agent's reward depends on its own state-action pair as well as those of its direct neighbors, and each agent's policy is parameterized by its local parameters together with those of its $κ_{p}$-hop neighbors, with $κ_{p}\geq 1$ denoting the coupled radius. The objective of the agents is to collaboratively optimize their policies to maximize the discounted average cumulative reward. To address the challenge of interdependent policies in collaborative optimization, we introduce a novel concept termed the neighbors' averaged $Q$-function and derive a new expression for the coupled policy gradient. Based on these theoretical foundations, we develop a distributed scalable coupled policy (DSCP) algorithm, where each agent relies only on the state-action pairs of its $κ_{p}$-hop neighbors and the rewards of its $(κ_{p}+1)$-hop neighbors. Specially, in the DSCP algorithm, we employ a geometric 2-horizon sampling method that does not require storing a full $Q$-table to obtain an unbiased estimate of the coupled policy gradient. Moreover, each agent interacts exclusively with its direct neighbors to obtain accurate policy parameters, while maintaining local estimates of other agents' parameters to execute its local policy and collect samples for optimization. These estimates and policy parameters are updated via a push-sum protocol, enabling distributed coordination of policy updates across the network. We prove that the joint policy produced by the proposed algorithm converges to a first-order stationary point of the objective function. Finally, the effectiveness of DSCP algorithm is demonstrated through simulations in a robot path planning environment, showing clear improvement over state-of-the-art methods.
翻译:本文研究了具有相互依赖奖励与耦合策略的网络化多智能体强化学习(NMARL)。在此设定下,每个智能体的奖励不仅取决于其自身的状态-动作对,还取决于其直接邻居的状态-动作对;同时,每个智能体的策略由其局部参数及其$κ_{p}$-跳邻居的参数共同参数化,其中$κ_{p}\\geq 1$表示耦合半径。智能体的目标是通过协作优化各自的策略,以最大化折扣平均累积奖励。为应对协作优化中策略相互依赖的挑战,我们引入了一个称为邻居平均$Q$函数的新概念,并推导出耦合策略梯度的新表达式。基于这些理论基础,我们提出了一种分布式可扩展耦合策略(DSCP)算法,其中每个智能体仅依赖其$κ_{p}$-跳邻居的状态-动作对及其$(κ_{p}+1)$-跳邻居的奖励。特别地,在DSCP算法中,我们采用了一种几何双时域采样方法,无需存储完整的$Q$表即可获得耦合策略梯度的无偏估计。此外,每个智能体仅与其直接邻居交互以获取精确的策略参数,同时维护对其他智能体参数的局部估计,以执行其局部策略并收集优化所需的样本。这些估计值和策略参数通过推和协议进行更新,实现了策略更新在网络中的分布式协调。我们证明了所提算法生成的联合策略收敛于目标函数的一阶驻点。最后,通过在机器人路径规划环境中的仿真实验,验证了DSCP算法的有效性,其性能明显优于现有先进方法。