We study the scalable multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure. The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team. By exploiting the spatial correlation decay property of the network structure, we propose a scalable distributed policy gradient algorithm with shadow reward and localized policy that consists of three steps: (1) shadow reward estimation, (2) truncated shadow Q-function estimation, and (3) truncated policy gradient estimation and policy update. Our algorithm converges, with high probability, to $\epsilon$-stationarity with $\widetilde{\mc{O}}(\epsilon^{-2})$ samples up to some approximation error that decreases exponentially in the communication radius. This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.
翻译:我们研究具有通用效用的可扩展多智能体强化学习(MARL),其中通用效用定义为团队长期状态-动作占用测度的非线性函数。目标是寻找一种局部化策略,在团队中每个智能体不完全可观测的条件下,最大化团队局部效用函数的平均值。通过利用网络结构的空间相关性衰减特性,我们提出了一种具有影子奖励和局部化策略的可扩展分布式策略梯度算法,该算法包含三个步骤:(1)影子奖励估计,(2)截断影子Q函数估计,以及(3)截断策略梯度估计与策略更新。我们的算法以高概率收敛到$\epsilon$-稳定点,所需样本量为$\widetilde{\mc{O}}(\epsilon^{-2})$,且逼近误差随通信半径呈指数衰减。这是文献中首个无需完全可观测性即可处理通用效用的多智能体强化学习结果。