Safe reinforcement learning (RL) aims to solve an optimal control problem under safety constraints. Existing $\textit{direct}$ safe RL methods use the original constraint throughout the learning process. They either lack theoretical guarantees of the policy during iteration or suffer from infeasibility problems. To address this issue, we propose an $\textit{indirect}$ safe RL method called feasible policy iteration (FPI) that iteratively uses the feasible region of the last policy to constrain the current policy. The feasible region is represented by a feasibility function called constraint decay function (CDF). The core of FPI is a region-wise policy update rule called feasible policy improvement, which maximizes the return under the constraint of the CDF inside the feasible region and minimizes the CDF outside the feasible region. This update rule is always feasible and ensures that the feasible region monotonically expands and the state-value function monotonically increases inside the feasible region. Using the feasible Bellman equation, we prove that FPI converges to the maximum feasible region and the optimal state-value function. Experiments on classic control tasks and Safety Gym show that our algorithms achieve lower constraint violations and comparable or higher performance than the baselines.
翻译:安全强化学习旨在解决安全约束下的最优控制问题。现有$\textit{直接}$安全强化学习方法在整个学习过程中使用原始约束,要么在迭代过程中缺乏策略的理论保证,要么面临不可行问题。为解决这一问题,我们提出一种$\textit{间接}$安全强化学习方法——可行策略迭代(FPI),该方法迭代地利用上一策略的可行区域来约束当前策略。该可行区域由一种称为约束衰减函数(CDF)的可行性函数表示。FPI的核心是一种区域级策略更新规则——可行策略改进,该规则在可行区域内最大化在CDF约束下的回报,在可行区域外最小化CDF值。该更新规则始终可行,并确保可行区域单调扩张,且可行区域内的状态价值函数单调递增。利用可行贝尔曼方程,我们证明FPI收敛到最大可行区域和最优状态价值函数。在经典控制任务和Safety Gym上的实验表明,我们的算法实现了比基线更低的约束违反率和相当或更高的性能。