Primal-dual safe RL methods commonly perform iterations between the primal update of the policy and the dual update of the Lagrange Multiplier. Such a training paradigm is highly susceptible to the error in cumulative cost estimation since this estimation serves as the key bond connecting the primal and dual update processes. We show that this problem causes significant underestimation of cost when using off-policy methods, leading to the failure to satisfy the safety constraint. To address this issue, we propose \textit{conservative policy optimization}, which learns a policy in a constraint-satisfying area by considering the uncertainty in cost estimation. This improves constraint satisfaction but also potentially hinders reward maximization. We then introduce \textit{local policy convexification} to help eliminate such suboptimality by gradually reducing the estimation uncertainty. We provide theoretical interpretations of the joint coupling effect of these two ingredients and further verify them by extensive experiments. Results on benchmark tasks show that our method not only achieves an asymptotic performance comparable to state-of-the-art on-policy methods while using much fewer samples, but also significantly reduces constraint violation during training. Our code is available at https://github.com/ZifanWu/CAL.
翻译:原始-对偶安全强化学习方法通常在对策略进行原始更新与对拉格朗日乘子进行对偶更新之间交替迭代。这种训练范式高度依赖于累积代价估计的准确性,因为该估计是连接原始更新过程与对偶更新过程的关键纽带。我们证明了这一问题在使用离策略方法时会导致对代价的显著低估,从而使安全性约束无法得到满足。为解决该问题,我们提出**保守策略优化**方法,通过考虑代价估计中的不确定性,在满足约束的区域中学习策略。该方法虽改善了约束满足性,但可能阻碍奖励最大化。为此,我们引入**局部策略凸化**技术,通过逐步降低估计不确定性来消除这种次优性。我们从理论上阐释了这两种成分的联合耦合效应,并通过大量实验进行验证。在基准任务上的结果表明,我们的方法不仅在使用更少样本的前提下达到了与最先进在策略方法相当的新进性能,还显著减少了训练过程中的约束违反次数。我们的代码已开源在 https://github.com/ZifanWu/CAL。