Learning Optimal Strategies for Temporal Tasks in Stochastic Games

Synthesis from linear temporal logic (LTL) specifications provides assured controllers for systems operating in stochastic and potentially adversarial environments. Automatic synthesis tools, however, require a model of the environment to construct controllers. In this work, we introduce a model-free reinforcement learning (RL) approach to derive controllers from given LTL specifications even when the environment is completely unknown. We model the problem as a stochastic game (SG) between the controller and the adversarial environment; we then learn optimal control strategies that maximize the probability of satisfying the LTL specifications against the worst-case environment behavior. We first construct a product game using the deterministic parity automaton (DPA) translated from the given LTL specification. By deriving distinct rewards and discount factors from the acceptance condition of the DPA, we reduce the maximization of the worst-case probability of satisfying the LTL specification into the maximization of a discounted reward objective in the product game; this enables the use of model-free RL algorithms to learn an optimal controller strategy. To deal with the common scalability problems when the number of sets defining the acceptance condition of the DPA (usually referred as colors), is large, we propose a lazy color generation method where distinct rewards and discount factors are utilized only when needed, and an approximate method where the controller eventually focuses on only one color. In several case studies, we show that our approach is scalable to a wide range of LTL formulas, significantly outperforming existing methods for learning controllers from LTL specifications in SGs.

翻译：从线性时序逻辑（LTL）规范进行综合，可为运行于随机且可能对抗性环境中的系统提供有保障的控制器。然而，自动综合工具需要环境模型来构建控制器。本文提出一种无模型强化学习（RL）方法，即使环境完全未知，也能从给定LTL规范推导出控制器。我们将问题建模为控制器与对抗性环境之间的随机博弈（SG），并学习最优控制策略，以最大化在最坏环境行为下满足LTL规范的概率。首先，我们利用从给定LTL规范转换得到的确定性奇偶自动机（DPA）构建乘积博弈。通过从DPA的接受条件中推导出不同的奖励和折扣因子，我们将最大化满足LTL规范的最坏情况概率问题转化为乘积博弈中折扣奖励目标的最大化问题，从而能够使用无模型RL算法学习最优控制器策略。针对DPA接受条件定义集合（通常称为颜色）数量较大时常见的规模扩展问题，我们提出一种惰性颜色生成方法（仅在需要时使用不同的奖励和折扣因子）以及一种近似方法（控制器最终仅聚焦于单一颜色）。在多个案例研究中，我们证明该方法可扩展至广泛的LTL公式，显著优于SG中从LTL规范学习控制器的现有方法。