Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents. Decentralised shields can enforce safety at runtime, but purely factorised permissions often exclude optimal team behaviour that is safe only through coordination. We study deterministic safety guarantees for agents trained and deployed under decentralised execution, recovering team-optimal safe behaviour without centralised runtime control. Agents have a shared global specification $φ$ in the safety fragment of Linear Temporal Logic ($\mathsf{LTL}_{\mathsf{safe}}$ ), and select among tuples of local $\mathsf{LTL}_{\mathsf{safe}}$ obligations whose conjunction implies the global specification $φ$. Each agent may rely on the other agents' local obligations as assumptions because the whole contract tuple is certified simultaneously and allows projection into local action masks. At learning time, a non-stationary multi-armed bandit chooses among a library of local $\mathsf{LTL}_{\mathsf{safe}}$ obligations to select the tuple that optimises team reward, all without forgoing end-to-end safety. We evaluate the approach across 6 environments and 15 algorithmic variants.
翻译:在全局安全无法由任何智能体单方面强制实施时,多智能体强化学习中出现安全协调问题:一个智能体动作的可行性可能依赖于其他智能体的动态过程。去中心化屏蔽能在运行时强制执行安全性,但纯粹的因子化权限通常会排除通过协调才能实现安全的、最优的团队行为。我们研究了在去中心化执行下训练和部署的智能体的确定性安全保证,在无需中心化运行时控制的情况下恢复了团队最优的安全行为。智能体共享共享一个以线性时序逻辑($\mathsf{LTL}_{\mathsf{safe}}$)安全片段表述的全局规范$φ$,并选择一组局部$\mathsf{LTL}_{\mathsf{safe}}$义务的元组,这些义务的合取蕴含全局规范$φ$。每个智能体可将其他智能体的局部义务作为假设依赖,因为整个合约元组是同时认证的,并允许投影到局部动作掩码中。在训练时,非平稳多臂赌博机从本地$\mathsf{LTL}_{\mathsf{safe}}$义务库中选择元组以优化团队奖励,且全程不放弃端到端安全性。我们在6个环境和15种算法变体上评估了该方法。