An important challenge for enabling the deployment of reinforcement learning (RL) algorithms in the real world is safety. This has resulted in the recent research field of Safe RL, which aims to learn optimal policies that are safe. One successful approach in that direction is probabilistic logic shields (PLS), a model-based Safe RL technique that uses formal specifications based on probabilistic logic programming, constraining an agent's policy to comply with those specifications in a probabilistic sense. However, safety is inherently a multi-agent concept, since real-world environments often involve multiple agents interacting simultaneously, leading to a complex system which is hard to control. Moreover, safe multi-agent RL (Safe MARL) is still underexplored. In order to address this gap, in this paper we ($i$) introduce Shielded MARL (SMARL) by extending PLS to MARL -- in particular, we introduce Probabilistic Logic Temporal Difference Learning (PLTD) to enable shielded independent Q-learning (SIQL), and introduce shielded independent PPO (SIPPO) using probabilistic logic policy gradients; ($ii$) show its positive effect and use as an equilibrium selection mechanism in various game-theoretic environments including two-player simultaneous games, extensive-form games, stochastic games, and some grid-world extensions in terms of safety, cooperation, and alignment with normative behaviors; and ($iii$) look into the asymmetric case where only one agent is shielded, and show that the shielded agent has a significant influence on the unshielded one, providing further evidence of SMARL's ability to enhance safety and cooperation in diverse multi-agent environments.
翻译:在现实世界中部署强化学习(RL)算法的一个重要挑战是安全性。这催生了近期安全强化学习(Safe RL)这一研究领域,其目标是学习安全的最优策略。该方向的一个成功方法是概率逻辑护盾(PLS),这是一种基于模型的安全强化学习技术,它利用基于概率逻辑编程的形式化规约,在概率意义上约束智能体的策略以符合这些规约。然而,安全性本质上是一个多智能体概念,因为现实环境通常涉及多个智能体同时交互,形成一个难以控制的复杂系统。此外,安全多智能体强化学习(Safe MARL)仍未被充分探索。为了填补这一空白,在本文中,我们(i)通过将PLS扩展到MARL,引入了护盾多智能体强化学习(SMARL)——具体而言,我们引入了概率逻辑时序差分学习(PLTD)以实现护盾独立Q学习(SIQL),并利用概率逻辑策略梯度引入了护盾独立PPO(SIPPO);(ii)在包括两人同时博弈、扩展式博弈、随机博弈以及一些网格世界扩展在内的多种博弈论环境中,展示了其在安全性、合作性以及与规范行为对齐方面的积极效果,并将其用作均衡选择机制;(iii)研究了仅有一个智能体受护盾保护的非对称情况,并表明受护盾智能体对未受护盾智能体具有显著影响,这为SMARL在多样化多智能体环境中提升安全性与合作能力提供了进一步的证据。