Safe reinforcement learning (RL) is a popular and versatile paradigm to learn reward-maximizing policies with safety guarantees. Previous works tend to express the safety constraints in an expectation form due to the ease of implementation, but this turns out to be ineffective in maintaining safety constraints with high probability. To this end, we move to the quantile-constrained RL that enables a higher level of safety without any expectation-form approximations. We directly estimate the quantile gradients through sampling and provide the theoretical proofs of convergence. Then a tilted update strategy for quantile gradients is implemented to compensate the asymmetric distributional density, with a direct benefit of return performance. Experiments demonstrate that the proposed model fully meets safety requirements (quantile constraints) while outperforming the state-of-the-art benchmarks with higher return.
翻译:安全强化学习(RL)是一种流行且通用的范式,用于学习具有安全保证的奖励最大化策略。先前的研究倾向于以期望形式表达安全约束,因其易于实现,但这在大概率维持安全约束方面效果不佳。为此,我们转向分位数约束强化学习,该范式能够在无需任何期望形式近似的情况下实现更高水平的安全性。我们通过采样直接估计分位数梯度,并提供了收敛性的理论证明。随后,实施了一种针对分位数梯度的倾斜更新策略,以补偿非对称分布密度,直接提升了回报性能。实验表明,所提出的模型完全满足安全要求(分位数约束),同时以更高的回报超越了最先进的基准方法。