Deep reinforcement learning has emerged as a powerful tool for obtaining high-performance policies. However, the safety of these policies has been a long-standing issue. One promising paradigm to guarantee safety is a shield, which shields a policy from making unsafe actions. However, computing a shield scales exponentially in the number of state variables. This is a particular concern in multi-agent systems with many agents. In this work, we propose a novel approach for multi-agent shielding. We address scalability by computing individual shields for each agent. The challenge is that typical safety specifications are global properties, but the shields of individual agents only ensure local properties. Our key to overcome this challenge is to apply assume-guarantee reasoning. Specifically, we present a sound proof rule that decomposes a (global, complex) safety specification into (local, simple) obligations for the shields of the individual agents. Moreover, we show that applying the shields during reinforcement learning significantly improves the quality of the policies obtained for a given training budget. We demonstrate the effectiveness and scalability of our multi-agent shielding framework in two case studies, reducing the computation time from hours to seconds and achieving fast learning convergence.
翻译:深度强化学习已成为获取高性能策略的强大工具。然而,这些策略的安全性一直是一个长期存在的问题。保障安全性的一个有效范式是防护机制,它能够防止策略执行不安全动作。但防护机制的计算复杂度随状态变量数量呈指数级增长,这在包含大量智能体的多智能体系统中尤为突出。本研究提出了一种新颖的多智能体防护方法。我们通过为每个智能体单独计算防护机制来解决可扩展性问题。挑战在于典型的安全性规范通常是全局属性,而单个智能体的防护机制仅能确保局部属性。我们克服这一挑战的关键在于应用假设-保证推理。具体而言,我们提出了一种可靠证明规则,将(全局、复杂的)安全性规范分解为各个智能体防护机制的(局部、简单的)责任要求。此外,我们证明在强化学习过程中应用防护机制能显著提升给定训练预算下所获策略的质量。通过两个案例研究,我们展示了多智能体防护框架的有效性和可扩展性,将计算时间从数小时缩短至数秒,并实现了快速的学习收敛。