We study the scalable multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure. The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team. By exploiting the spatial correlation decay property of the network structure, we propose a scalable distributed policy gradient algorithm with shadow reward and localized policy that consists of three steps: (1) shadow reward estimation, (2) truncated shadow Q-function estimation, and (3) truncated policy gradient estimation and policy update. Our algorithm converges, with high probability, to $\epsilon$-stationarity with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ samples up to some approximation error that decreases exponentially in the communication radius. This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.
翻译:我们研究具有一般效用的可扩展多智能体强化学习,其中的效用定义为团队长期状态-动作占用度量的非线性函数。目标是在不要求团队中各智能体完全可观测的条件下,找到能最大化团队局部效用函数平均值的局部化策略。通过利用网络结构的空间相关性衰减特性,我们提出一种基于影子奖励和局部化策略的可扩展分布式策略梯度算法,该算法包含三个步骤:(1)影子奖励估计,(2)截断影子Q函数估计,以及(3)截断策略梯度估计与策略更新。我们的算法以高概率收敛到$\epsilon$-稳定点,所需样本量为$\widetilde{\mathcal{O}}(\epsilon^{-2})$,且近似误差随通信半径呈指数衰减。这是文献中首个无需完全可观测性即可处理一般效用的多智能体强化学习结果。