Towards a Practical Understanding of Lagrangian Methods in Safe Reinforcement Learning

Safe reinforcement learning addresses constrained optimization problems where maximizing performance must be balanced against safety constraints, and Lagrangian methods are a widely used approach for this purpose. However, the effectiveness of Lagrangian methods depends crucially on the choice of the Lagrange multiplier $λ$, which governs the multi-objective trade-off between return and cost. A common practice is to update the multiplier automatically during training. Although this approach is standard in practice, there remains limited empirical evidence on the optimally achievable trade-off between return and cost as a function of $λ$, and there is currently no systematic benchmark comparing automated update mechanisms to this empirical optimum. Therefore, we study (i) the constraint geometry for eight widely used safety tasks and (ii) the previously overlooked constraint-regime sensitivity of different Lagrange multiplier update mechanisms in safe reinforcement learning. Through the lens of multi-objective analysis, we present empirical Pareto frontiers that offer a complete visualization of the trade-off between return and cost in the underlying optimization problem. Our results reveal the highly sensitive nature of $λ$ and further show that the restrictiveness of the constraint cost can vary across different cost limits within the same task. This highlights the importance of careful cost limit selection across different regions of cost restrictiveness when evaluating safe reinforcement learning methods. We provide a recommended set of cost limits for each evaluated task and offer an open-source code base: https://github.com/lindsayspoor/Lagrangian_SafeRL.

翻译：安全强化学习处理需在最大化性能与满足安全约束之间权衡的约束优化问题，而拉格朗日方法是广泛用于此类问题的技术手段。然而，拉格朗日方法的有效性关键取决于拉格朗日乘子$λ$的选择，该参数控制回报与成本之间的多目标权衡。常见实践是在训练过程中自动更新该乘子。尽管这一方法在实践中的标准地位已确立，但关于$λ$取值如何影响回报与成本最优可达权衡的实证证据仍十分有限，且目前缺乏将自动更新机制与此经验最优值进行系统性比较的基准。为此，我们研究了以下两方面内容：(i) 八个常用安全任务的约束几何特性；(ii) 安全强化学习中不同拉格朗日乘子更新机制对约束区域敏感性这一曾被忽视的问题。通过多目标分析的视角，我们呈现了经验帕累托前沿，完整可视化底层优化问题中回报与成本间的权衡关系。研究结果揭示了$λ$的高度敏感性，并进一步表明，同一任务中约束成本的严格程度可能随成本限制的差异而变化。这凸显了在评估安全强化学习方法时，需根据成本严格性的不同区域审慎选择成本限制的重要性。我们为每个评估任务提供了一组推荐的成本限制，并开放源代码库：https://github.com/lindsayspoor/Lagrangian_SafeRL。