Multi-agent reinforcement learning (MARL) has achieved notable success in cooperative tasks, demonstrating impressive performance and scalability. However, deploying MARL agents in real-world applications presents critical safety challenges. Current safe MARL algorithms are largely based on the constrained Markov decision process (CMDP) framework, which enforces constraints only on discounted cumulative costs and lacks an all-time safety assurance. Moreover, these methods often overlook the feasibility issue (the system will inevitably violate state constraints within certain regions of the constraint set), resulting in either suboptimal performance or increased constraint violations. To address these challenges, we propose a novel theoretical framework for safe MARL with $\textit{state-wise}$ constraints, where safety requirements are enforced at every state the agents visit. To resolve the feasibility issue, we leverage a control-theoretic notion of the feasible region, the controlled invariant set (CIS), characterized by the safety value function. We develop a multi-agent method for identifying CISs, ensuring convergence to a Nash equilibrium on the safety value function. By incorporating CIS identification into the learning process, we introduce a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained cooperative Markov games, achieving an optimal balance between feasibility and performance. Furthermore, for practical deployment in complex high-dimensional systems, we propose $\textit{Multi-Agent Dual Actor-Critic}$ (MADAC), a safe MARL algorithm that approximates the proposed iteration scheme within the deep RL paradigm. Empirical evaluations on safe MARL benchmarks demonstrate that MADAC consistently outperforms existing methods, delivering much higher rewards while reducing constraint violations.
翻译:多智能体强化学习(MARL)在协作任务中取得了显著成功,展现出卓越的性能与可扩展性。然而,在实际应用中部署MARL智能体面临严峻的安全挑战。当前的安全MARL算法主要基于约束马尔可夫决策过程(CMDP)框架,该框架仅对折扣累积成本施加约束,缺乏全时段安全保障。此外,这些方法常忽视可行性问题(系统在约束集的特定区域内必然违反状态约束),导致性能次优或约束违反增加。为应对这些挑战,我们提出一种具有$\textit{状态层面}$约束的安全MARL新理论框架,其中安全要求在智能体访问的每个状态均被强制执行。为解决可行性问题,我们引入控制理论中的可行域概念——由安全值函数刻画的受控不变集(CIS)。我们开发了一种识别CIS的多智能体方法,确保安全值函数收敛至纳什均衡。通过将CIS识别融入学习过程,我们提出一种多智能体对偶策略迭代算法,该算法保证在状态约束协作马尔可夫博弈中收敛至广义纳什均衡,实现可行性与性能的最优平衡。此外,为适应复杂高维系统的实际部署,我们提出$\textit{多智能体对偶执行者-评论家}$(MADAC)算法——一种在深度强化学习范式下近似实现所提迭代方案的安全MARL算法。在安全MARL基准测试中的实证评估表明,MADAC持续优于现有方法,在显著降低约束违反的同时获得更高奖励。