In this paper, we consider the problem of learning safe policies for probabilistic-constrained reinforcement learning (RL). Specifically, a safe policy or controller is one that, with high probability, maintains the trajectory of the agent in a given safe set. We establish a connection between this probabilistic-constrained setting and the cumulative-constrained formulation that is frequently explored in the existing literature. We provide theoretical bounds elucidating that the probabilistic-constrained setting offers a better trade-off in terms of optimality and safety (constraint satisfaction). The challenge encountered when dealing with the probabilistic constraints, as explored in this work, arises from the absence of explicit expressions for their gradients. Our prior work provides such an explicit gradient expression for probabilistic constraints which we term Safe Policy Gradient-REINFORCE (SPG-REINFORCE). In this work, we provide an improved gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE, which is substantiated by our theoretical results. A noteworthy aspect of both SPGs is their inherent algorithm independence, rendering them versatile for application across a range of policy-based algorithms. Furthermore, we propose a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe policies. It is subsequently followed by theoretical analyses that encompass the convergence of the algorithm, as well as the near-optimality and feasibility on average. In addition, we test the proposed approaches by a series of empirical experiments. These experiments aim to examine and analyze the inherent trade-offs between the optimality and safety, and serve to substantiate the efficacy of two SPGs, as well as our theoretical contributions.
翻译:本文研究概率约束强化学习中的安全策略学习问题。具体而言,安全策略或控制器需要以高概率确保智能体的轨迹保持在给定安全集内。我们建立了该概率约束设置与现有文献中广泛研究的累积约束形式化之间的关联,并给出了理论界值,阐明概率约束设置在最优性与安全性(约束满足)之间具有更优的权衡。本工作所探讨的概-约束问题面临的挑战在于其梯度缺乏显式表达式。我们先前的研究给出了概率约束的显式梯度表达式,并将其命名为安全策略梯度-REINFORCE(SPG-REINFORCE)。本文进一步提出了改进型梯度SPG-Actor-Critic,其方差低于SPG-REINFORCE,该结论得到了理论结果的支撑。两种SPG方法的显著特点在于其内在的算法无关性,使其可灵活应用于各类基于策略的算法。此外,我们提出了一种安全原始-对偶算法,可利用两种SPG学习安全策略。随后进行的理论分析涵盖了算法的收敛性、平均意义下的近最优性与可行性。最后,我们通过一系列实证实验对所提方法进行了测试。这些实验旨在检验并分析最优性与安全性之间的内在权衡,同时验证两种SPG方法及理论贡献的有效性。