This paper studies a class of multi-agent reinforcement learning (MARL) problems where the reward that an agent receives depends on the states of other agents, but the next state only depends on the agent's own current state and action. We name it REC-MARL standing for REward-Coupled Multi-Agent Reinforcement Learning. REC-MARL has a range of important applications such as real-time access control and distributed power control in wireless networks. This paper presents a distributed policy gradient algorithm for REC-MARL. The proposed algorithm is distributed in two aspects: (i) the learned policy is a distributed policy that maps a local state of an agent to its local action and (ii) the learning/training is distributed, during which each agent updates its policy based on its own and neighbors' information. The learned algorithm achieves a stationary policy and its iterative complexity bounds depend on the dimension of local states and actions. The experimental results of our algorithm for the real-time access control and power control in wireless networks show that our policy significantly outperforms the state-of-the-art algorithms and well-known benchmarks.
翻译:本文研究一类多智能体强化学习(MARL)问题,其中智能体获得的奖励依赖于其他智能体的状态,但下一状态仅依赖于智能体自身的当前状态和动作。我们将其命名为REC-MARL,即奖励耦合多智能体强化学习。REC-MARL具有一系列重要应用,例如无线网络中的实时接入控制和分布式功率控制。本文提出了一种用于REC-MARL的分布式策略梯度算法。该算法在两方面具有分布式特性:(i)学习得到的策略是一种分布式策略,能将智能体的局部状态映射至其局部动作;(ii)学习/训练过程是分布式的,每个智能体基于自身及邻居的信息更新其策略。该学习算法能收敛至稳定策略,其迭代复杂度上界取决于局部状态和动作的维度。我们在无线网络实时接入控制与功率控制任务上的实验结果表明,我们的策略显著优于现有最优算法及著名基准方法。