Constrained Reinforcement Learning (CRL) tackles sequential decision-making problems where agents are required to achieve goals by maximizing the expected return while meeting domain-specific constraints, which are often formulated as expected costs. In this setting, policy-based methods are widely used since they come with several advantages when dealing with continuous-control problems. These methods search in the policy space with an action-based or parameter-based exploration strategy, depending on whether they learn directly the parameters of a stochastic policy or those of a stochastic hyperpolicy. In this paper, we propose a general framework for addressing CRL problems via gradient-based primal-dual algorithms, relying on an alternate ascent/descent scheme with dual-variable regularization. We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-iterate convergence guarantees under (weak) gradient domination assumptions, improving and generalizing existing results. Then, we design C-PGAE and C-PGPE, the action-based and the parameter-based versions of C-PG, respectively, and we illustrate how they naturally extend to constraints defined in terms of risk measures over the costs, as it is often requested in safety-critical scenarios. Finally, we numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines, demonstrating their effectiveness.
翻译:约束强化学习(CRL)旨在解决序列决策问题,其中智能体需在满足领域特定约束(通常建模为期望代价)的同时,通过最大化期望回报来实现目标。在此设定下,基于策略的方法因其在处理连续控制问题时的多重优势而被广泛采用。这些方法在策略空间中进行搜索,其探索策略可分为基于动作或基于参数两类,具体取决于它们是直接学习随机策略的参数还是随机超策略的参数。本文提出一个通用框架,通过基于梯度的原始-对偶算法解决CRL问题,该框架依赖于结合对偶变量正则化的交替上升/下降机制。我们提出一种与探索策略无关的算法C-PG,该算法在(弱)梯度主导假设下具有全局末次迭代收敛性保证,从而改进并推广了现有结果。随后,我们分别设计了基于动作的C-PGAE与基于参数的C-PGPE作为C-PG的两种实现形式,并阐明它们如何自然扩展到以代价风险度量定义的约束场景——这在安全关键应用中尤为常见。最后,我们在约束控制问题上对算法进行数值验证,并与前沿基线方法进行比较,结果证明了所提算法的有效性。