The field of safe multi-agent reinforcement learning, despite its potential applications in various domains such as drone delivery and vehicle automation, remains relatively unexplored. Training agents to learn optimal policies that maximize rewards while considering specific constraints can be challenging, particularly in scenarios where having a central controller to coordinate the agents during the training process is not feasible. In this paper, we address the problem of multi-agent policy optimization in a decentralized setting, where agents communicate with their neighbors to maximize the sum of their cumulative rewards while also satisfying each agent's safety constraints. We consider both peak and average constraints. In this scenario, there is no central controller coordinating the agents and both the rewards and constraints are only known to each agent locally/privately. We formulate the problem as a decentralized constrained multi-agent Markov Decision Problem and propose a momentum-based decentralized policy gradient method, DePAint, to solve it. To the best of our knowledge, this is the first privacy-preserving fully decentralized multi-agent reinforcement learning algorithm that considers both peak and average constraints. We also provide theoretical analysis and empirical evaluation of our algorithm in various scenarios and compare its performance to centralized algorithms that consider similar constraints.
翻译:安全多智能体强化学习领域虽然在无人机配送、车辆自动化等多个领域具有潜在应用价值,但目前仍相对探索不足。训练智能体在考虑特定约束的同时学习最大化奖励的最优策略具有挑战性,尤其是在训练过程中无法设置中央控制器来协调智能体的场景中。本文研究去中心化环境下的多智能体策略优化问题——智能体通过与其邻居通信来最大化累积奖励之和,同时满足每个智能体的安全约束。我们同时考虑了峰值约束和平均约束。在该场景中,既无中央控制器协调各智能体,且奖励和约束仅由各智能体本地/私有地获知。我们将该问题建模为去中心化约束多智能体马尔可夫决策过程,并提出一种基于动量的去中心化策略梯度方法DePAint来求解。据我们所知,这是首个同时考虑峰值和平均约束的隐私保护全去中心化多智能体强化学习算法。我们还在多种场景下提供了算法的理论分析与实证评估,并将其性能与考虑类似约束的集中式算法进行了比较。