Topological Guided Actor-Critic Modular Learning of Continuous Systems with Temporal Objectives

This work investigates the formal policy synthesis of continuous-state stochastic dynamic systems given high-level specifications in linear temporal logic. To learn an optimal policy that maximizes the satisfaction probability, we take a product between a dynamic system and the translated automaton to construct a product system on which we solve an optimal planning problem. Since this product system has a hybrid product state space that results in reward sparsity, we introduce a generalized optimal backup order, in reverse to the topological order, to guide the value backups and accelerate the learning process. We provide the optimality proof for using the generalized optimal backup order in this optimal planning problem. Further, this paper presents an actor-critic reinforcement learning algorithm when topological order applies. This algorithm leverages advanced mathematical techniques and enjoys the property of hyperparameter self-tuning. We provide proof of the optimality and convergence of our proposed reinforcement learning algorithm. We use neural networks to approximate the value function and policy function for hybrid product state space. Furthermore, we observe that assigning integer numbers to automaton states can rank the value or policy function approximated by neural networks. To break the ordinal relationship, we use an individual neural network for each automaton state's value (policy) function, termed modular learning. We conduct two experiments. First, to show the efficacy of our reinforcement learning algorithm, we compare it with baselines on a classic control task, CartPole. Second, we demonstrate the empirical performance of our formal policy synthesis framework on motion planning of a Dubins car with a temporal specification.

翻译：本文研究给定线性时序逻辑高层规范下连续状态随机动态系统的形式化策略综合问题。为学习最大化满足概率的最优策略，我们将动态系统与转换后的自动机构建乘积系统，并在该乘积系统上求解最优规划问题。由于乘积系统的混合乘积状态空间会导致奖励稀疏性，我们引入与拓扑序相反的广义最优回溯顺序来引导值函数回溯并加速学习过程，同时给出了在该最优规划问题中使用广义最优回溯顺序的最优性证明。此外，本文提出一种基于拓扑序的演员-评论家强化学习算法，该算法利用先进数学技术并具备超参数自适应调节特性。我们给出了所提强化学习算法最优性与收敛性的证明，并采用神经网络逼近混合乘积状态空间的值函数与策略函数。进一步研究发现，为自动机状态分配整数值可对神经网络逼近的值函数或策略函数进行排序。为打破这种序关系，我们为每个自动机状态的值函数（策略函数）分别采用独立神经网络进行模块化学习。我们开展了两项实验：首先在经典控制任务CartPole上通过与基线方法对比验证强化学习算法的有效性；其次在Dubins小车的时序规范运动规划任务中展示形式化策略综合框架的实证性能。