Ensuring safety is important for the practical deployment of reinforcement learning (RL). Various challenges must be addressed, such as handling stochasticity in the environments, providing rigorous guarantees of persistent state-wise safety satisfaction, and avoiding overly conservative behaviors that sacrifice performance. We propose a new framework, Reachability Estimation for Safe Policy Optimization (RESPO), for safety-constrained RL in general stochastic settings. In the feasible set where there exist violation-free policies, we optimize for rewards while maintaining persistent safety. Outside this feasible set, our optimization produces the safest behavior by guaranteeing entrance into the feasible set whenever possible with the least cumulative discounted violations. We introduce a class of algorithms using our novel reachability estimation function to optimize in our proposed framework and in similar frameworks such as those concurrently handling multiple hard and soft constraints. We theoretically establish that our algorithms almost surely converge to locally optimal policies of our safe optimization framework. We evaluate the proposed methods on a diverse suite of safe RL environments from Safety Gym, PyBullet, and MuJoCo, and show the benefits in improving both reward performance and safety compared with state-of-the-art baselines.
翻译:确保安全性对于强化学习(RL)的实际部署至关重要。必须应对各种挑战,例如处理环境中的随机性、提供严格的状态级持久安全满足保证,以及避免过度保守的行为而牺牲性能。我们提出了一种新框架——安全策略优化的可达性估计(RESPO),用于一般随机环境下的安全约束RL。在存在无违规策略的可行集中,我们在保持持久安全性的同时优化奖励。在该可行集之外,我们的优化通过尽可能以最小累积折扣违规确保进入可行集,从而产生最安全的行为。我们引入了一类算法,利用新型可达性估计函数在我们提出的框架以及类似框架(例如同时处理多个硬约束和软约束的框架)中进行优化。我们在理论上证明,我们的算法几乎必然收敛到我们安全优化框架的局部最优策略。我们在Safety Gym、PyBullet和MuJoCo提供的多样化安全RL环境套件上对提出的方法进行了评估,并展示了在改进奖励性能及安全性方面相较于现有最先进基线的优势。