In this paper, we consider the problem of learning safe policies for probabilistic-constrained reinforcement learning (RL). Specifically, a safe policy or controller is one that, with high probability, maintains the trajectory of the agent in a given safe set. We establish a connection between this probabilistic-constrained setting and the cumulative-constrained formulation that is frequently explored in the existing literature. We provide theoretical bounds elucidating that the probabilistic-constrained setting offers a better trade-off in terms of optimality and safety (constraint satisfaction). The challenge encountered when dealing with the probabilistic constraints, as explored in this work, arises from the absence of explicit expressions for their gradients. Our prior work provides such an explicit gradient expression for probabilistic constraints which we term Safe Policy Gradient-REINFORCE (SPG-REINFORCE). In this work, we provide an improved gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE, which is substantiated by our theoretical results. A noteworthy aspect of both SPGs is their inherent algorithm independence, rendering them versatile for application across a range of policy-based algorithms. Furthermore, we propose a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe policies. It is subsequently followed by theoretical analyses that encompass the convergence of the algorithm, as well as the near-optimality and feasibility on average. In addition, we test the proposed approaches by a series of empirical experiments. These experiments aim to examine and analyze the inherent trade-offs between the optimality and safety, and serve to substantiate the efficacy of two SPGs, as well as our theoretical contributions.
翻译:本文研究面向概率约束强化学习的安全策略学习问题。具体而言,安全策略或控制器是指能够以高概率保证智能体轨迹维持在给定安全集内的策略。我们揭示了该概率约束设定与现有文献中广泛研究的累积约束设定之间的内在关联,并通过理论界限证明概率约束设定在最优性与安全性(约束满足)之间提供了更优的权衡。本工作中处理概率约束的核心挑战在于其梯度缺乏显式表达式。我们的前期工作已提出概率约束的显式梯度表达式——安全策略梯度-REINFORCE(SPG-REINFORCE)。本文进一步提出改进梯度算法SPG-Actor-Critic,其方差低于SPG-REINFORCE,该结论得到理论结果支撑。值得关注的是,两类SPG算法均天然具备算法无关性,可便捷应用于多种基于策略的算法框架。此外,我们提出能联合运用两种SPG学习安全策略的安全原-对偶算法,并给出包含算法收敛性、平均意义下近最优性与可达性的理论分析。最后通过系列实证实验验证所提方法,旨在探究并分析最优性与安全性间的内在权衡,同时证实两类SPG算法的有效性及理论贡献。