Primal-dual methods have a natural application in Safe Reinforcement Learning (SRL), posed as a constrained policy optimization problem. In practice however, applying primal-dual methods to SRL is challenging, due to the inter-dependency of the learning rate (LR) and Lagrangian multipliers (dual variables) each time an embedded unconstrained RL problem is solved. In this paper, we propose, analyze and evaluate adaptive primal-dual (APD) methods for SRL, where two adaptive LRs are adjusted to the Lagrangian multipliers so as to optimize the policy in each iteration. We theoretically establish the convergence, optimality and feasibility of the APD algorithm. Finally, we conduct numerical evaluation of the practical APD algorithm with four well-known environments in Bullet-Safey-Gym employing two state-of-the-art SRL algorithms: PPO-Lagrangian and DDPG-Lagrangian. All experiments show that the practical APD algorithm outperforms (or achieves comparable performance) and attains more stable training than the constant LR cases. Additionally, we substantiate the robustness of selecting the two adaptive LRs by empirical evidence.
翻译:原-对偶方法在安全强化学习(SRL)中具有天然应用,该方法将问题建模为带约束的策略优化问题。然而在实际应用中,由于每次求解嵌入式无约束强化学习问题时,学习率(LR)与拉格朗日乘子(对偶变量)之间存在相互依赖关系,直接应用原-对偶方法处理SRL具有挑战性。本文提出、分析并评估了面向SRL的自适应原-对偶(APD)方法,该方法通过动态调整两个自适应学习率以适配拉格朗日乘子,从而在每次迭代中优化策略。我们从理论上证明了APD算法的收敛性、最优性与可行性。最后,我们基于Bullet-Safey-Gym中的四个经典环境,结合两种前沿SRL算法(PPO-Lagrangian与DDPG-Lagrangian),对实用化APD算法进行数值评估。所有实验表明:与固定学习率方案相比,实用化APD算法性能更优(或相当),且训练稳定性更高。此外,我们通过实证证据证实了这两个自适应学习率选取的鲁棒性。