Safe reinforcement learning (Safe RL) refers to a class of techniques that aim to prevent RL algorithms from violating constraints in the process of decision-making and exploration during trial and error. In this paper, a novel model-free Safe RL algorithm, formulated based on the multi-objective policy optimization framework is introduced where the policy is optimized towards optimality and safety, simultaneously. The optimality is achieved by the environment reward function that is subsequently shaped using a safety critic. The advantage of the Safety Optimized RL (SORL) algorithm compared to the traditional Safe RL algorithms is that it omits the need to constrain the policy search space. This allows SORL to find a natural tradeoff between safety and optimality without compromising the performance in terms of either safety or optimality due to strict search space constraints. Through our theoretical analysis of SORL, we propose a condition for SORL's converged policy to guarantee safety and then use it to introduce an aggressiveness parameter that allows for fine-tuning the mentioned tradeoff. The experimental results obtained in seven different robotic environments indicate a considerable reduction in the number of safety violations along with higher, or competitive, policy returns, in comparison to six different state-of-the-art Safe RL methods. The results demonstrate the significant superiority of the proposed SORL algorithm in safety-critical applications.
翻译:安全强化学习(Safe RL)是指一类旨在防止强化学习算法在试错过程的决策与探索中违反约束条件的技术。本文提出了一种新颖的无模型安全强化学习算法,该算法基于多目标策略优化框架构建,能够同时优化策略的最优性与安全性。其中,最优性通过环境奖励函数实现,并进一步利用安全评价器对该函数进行塑形。与传统安全强化学习算法相比,安全优化强化学习(SORL)算法的优势在于无需约束策略搜索空间。这使得SORL能够在安全性与最优性之间自然权衡,而不会因严格的搜索空间约束而损害任一方面的性能。通过理论分析,我们提出了SORL收敛策略满足安全性的条件,并据此引入一个激进参数来实现对上述权衡的精细调节。在七个不同机器人环境中获得的实验结果表明,与六种先进的安全强化学习方法相比,SORL在显著减少安全违规次数的同时,获得了更高或具有竞争力的策略回报。实验结果证明了所提出的SORL算法在安全关键应用中的显著优越性。