Reinforcement Learning (RL) is a widely employed machine learning architecture that has been applied to a variety of control problems. However, applications in safety-critical domains require a systematic and formal approach to specifying requirements as tasks or goals. We propose a model-free RL algorithm that enables the use of Linear Temporal Logic (LTL) to formulate a goal for unknown continuous-state/action Markov Decision Processes (MDPs). The given LTL property is translated into a Limit-Deterministic Generalised Buchi Automaton (LDGBA), which is then used to shape a synchronous reward function on-the-fly. Under certain assumptions, the algorithm is guaranteed to synthesise a control policy whose traces satisfy the LTL specification with maximal probability.
翻译:强化学习(RL)是一种广泛应用的机器学习架构,已被用于解决各类控制问题。然而,在安全关键领域的应用需要系统化、形式化的方法将需求指定为任务或目标。我们提出一种无模型强化学习算法,该算法能够利用线性时态逻辑(LTL)为未知连续状态/动作马尔可夫决策过程(MDP)制定目标。给定LTL属性将被转换为极限确定广义布奇自动机(LDGBA),并据此在线构建同步奖励函数。在特定假设条件下,该算法可保证综合出控制策略,使得其轨迹以最大概率满足LTL规范。