Vanilla Reinforcement Learning (RL) can efficiently solve complex tasks but does not provide any guarantees on system behavior. Yet, for real-world systems, which are often safety-critical, such guarantees on safety specifications are necessary. To bridge this gap, we propose a safe RL procedure for continuous action spaces with verified probabilistic guarantees specified via temporal logic. First, our approach probabilistically verifies a candidate controller with respect to a temporal logic specification while randomizing the controller's inputs within an expansion set. Then, we use RL to improve the performance of this probabilistically verified controller and explore in the given expansion set around the controller's input. Finally, we calculate probabilistic safety guarantees with respect to temporal logic specifications for the learned agent. Our approach is efficiently implementable for continuous action and state spaces and separates safety verification and performance improvement into two distinct steps. We evaluate our approach on an evasion task where a robot has to reach a goal while evading a dynamic obstacle with a specific maneuver. Our results show that our safe RL approach leads to efficient learning while probablistically maintaining safety specifications.
翻译:原始强化学习(Vanilla RL)虽然能高效解决复杂任务,但无法提供任何关于系统行为的保证。然而,对于通常具有安全关键性的真实世界系统而言,这种针对安全规范(safety specifications)的保证是必要的。为弥合这一差距,我们提出了一种适用于连续动作空间的安全强化学习方法,该方法通过时序逻辑(temporal logic)提供经验证的概率性保证。首先,我们的方法在随机化控制器输入于扩张集(expansion set)的同时,针对时序逻辑规范对候选控制器进行概率验证。其次,我们利用强化学习来提升该经概率验证控制器的性能,并在控制器输入的给定扩张集内进行探索。最后,我们为学习到的智能体计算关于时序逻辑规范的概率安全保证。该方法可高效实现于连续动作与状态空间,并将安全验证与性能提升分为两个独立步骤。我们在一个规避任务上评估了该方法:机器人需通过特定机动动作规避动态障碍物并抵达目标。结果表明,我们的安全强化学习方法能够在概率性维持安全规范的同时实现高效学习。