We study how to synthesize a robust and safe policy for autonomous systems under signal temporal logic (STL) tasks in adversarial settings against unknown dynamic agents. To ensure the worst-case STL satisfaction, we propose STLGame, a framework that models the multi-agent system as a two-player zero-sum game, where the ego agents try to maximize the STL satisfaction and other agents minimize it. STLGame aims to find a Nash equilibrium policy profile, which is the best case in terms of robustness against unseen opponent policies, by using the fictitious self-play (FSP) framework. FSP iteratively converges to a Nash profile, even in games set in continuous state-action spaces. We propose a gradient-based method with differentiable STL formulas, which is crucial in continuous settings to approximate the best responses at each iteration of FSP. We show this key aspect experimentally by comparing with reinforcement learning-based methods to find the best response. Experiments on two standard dynamical system benchmarks, Ackermann steering vehicles and autonomous drones, demonstrate that our converged policy is almost unexploitable and robust to various unseen opponents' policies. All code and additional experimental results can be found on our project website: https://sites.google.com/view/stlgame
翻译:本研究探讨如何在对抗未知动态智能体的环境中,为执行信号时序逻辑(STL)任务的自主系统合成鲁棒且安全的策略。为确保最坏情况下的STL任务满足度,我们提出STLGame框架,将多智能体系统建模为二人零和博弈:己方智能体试图最大化STL满足度,而其他智能体则试图最小化该满足度。通过采用虚拟自博弈框架,STLGame旨在寻找纳什均衡策略组合——这是针对未知对手策略具有最优鲁棒性的策略配置。即使在连续状态-动作空间的博弈设定中,虚拟自博弈仍能迭代收敛至纳什均衡。我们提出基于梯度的可微STL公式计算方法,该方法在连续空间中对逼近虚拟自博弈每轮迭代的最优响应至关重要。通过与基于强化学习的最优响应求解方法进行实验对比,我们验证了这一关键特性。在阿克曼转向车辆和自主无人机两个标准动力学系统基准上的实验表明,我们收敛后的策略几乎无法被利用,且对各类未知对手策略具有强鲁棒性。所有代码及补充实验结果可通过项目网站获取:https://sites.google.com/view/stlgame