Vanilla Reinforcement Learning (RL) can efficiently solve complex tasks but does not provide any guarantees on system behavior. To bridge this gap, we propose a three-step safe RL procedure for continuous action spaces that provides probabilistic guarantees with respect to temporal logic specifications. First, our approach probabilistically verifies a candidate controller with respect to a temporal logic specification while randomizing the control inputs to the system within a bounded set. Second, we improve the performance of this probabilistically verified controller by adding an RL agent that optimizes the verified controller for performance in the same bounded set around the control input. Third, we verify probabilistic safety guarantees with respect to temporal logic specifications for the learned agent. Our approach is efficiently implementable for continuous action and state spaces. The separation of safety verification and performance improvement into two distinct steps realizes both explicit probabilistic safety guarantees and a straightforward RL setup that focuses on performance. We evaluate our approach on an evasion task where a robot has to reach a goal while evading a dynamic obstacle with a specific maneuver. Our results show that our safe RL approach leads to efficient learning while maintaining its probabilistic safety specification.
翻译:传统强化学习(RL)能高效解决复杂任务,但无法提供系统行为的任何保证。为弥补这一缺陷,我们提出一种面向连续动作空间的三步安全强化学习流程,可针对时态逻辑规范提供概率性保证。首先,该方法在将系统控制输入限定在边界集内随机化处理的同时,基于时态逻辑规范对候选控制器进行概率验证。其次,通过引入一个强化学习智能体优化该已验证控制器的性能,该智能体在控制输入周围的相同边界集内优化已验证控制器的表现。第三,我们验证学习智能体在时态逻辑规范下的概率安全保证。该方法可高效应用于连续动作空间与连续状态空间。通过将安全验证与性能优化分离为两个独立步骤,实现了显式的概率安全保证和专注于性能的直观强化学习框架。我们在躲避任务中评估了该方法,该任务要求机器人到达目标点并以特定机动动作规避动态障碍物。结果表明,所提出的安全强化学习方法在保持概率安全规范的同时实现了高效学习。