Safe reinforcement learning (RL) is crucial for deploying RL agents in real-world applications, as it aims to maximize long-term rewards while satisfying safety constraints. However, safe RL often suffers from sample inefficiency, requiring extensive interactions with the environment to learn a safe policy. We propose Efficient Safe Policy Optimization (ESPO), a novel approach that enhances the efficiency of safe RL through sample manipulation. ESPO employs an optimization framework with three modes: maximizing rewards, minimizing costs, and balancing the trade-off between the two. By dynamically adjusting the sampling process based on the observed conflict between reward and safety gradients, ESPO theoretically guarantees convergence, optimization stability, and improved sample complexity bounds. Experiments on the Safety-MuJoCo and Omnisafe benchmarks demonstrate that ESPO significantly outperforms existing primal-based and primal-dual-based baselines in terms of reward maximization and constraint satisfaction. Moreover, ESPO achieves substantial gains in sample efficiency, requiring 25--29% fewer samples than baselines, and reduces training time by 21--38%.
翻译:安全强化学习(Safe RL)对于在现实应用中部署强化学习智能体至关重要,其目标是在满足安全约束的同时最大化长期奖励。然而,安全强化学习通常面临样本效率低下的问题,需要与环境进行大量交互才能学习到安全策略。我们提出了一种新颖的方法——高效安全策略优化(ESPO),它通过样本操作来提升安全强化学习的效率。ESPO采用了一个包含三种模式的优化框架:最大化奖励、最小化成本以及平衡两者之间的权衡。通过基于观察到的奖励梯度与安全梯度之间的冲突动态调整采样过程,ESPO在理论上保证了收敛性、优化稳定性以及改进的样本复杂度界限。在Safety-MuJoCo和Omnisafe基准测试上的实验表明,ESPO在奖励最大化和约束满足方面显著优于现有的基于原始方法和基于原始-对偶方法的基线模型。此外,ESPO在样本效率方面取得了显著提升,所需样本比基线模型少25%至29%,并将训练时间减少了21%至38%。