Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes

The robust constrained Markov decision process (RCMDP) is a recent task-modelling framework for reinforcement learning that incorporates behavioural constraints and that provides robustness to errors in the transition dynamics model through the use of an uncertainty set. Simulating RCMDPs requires computing the worst-case dynamics based on value estimates for each state, an approach which has previously been used in the Robust Constrained Policy Gradient (RCPG). Highlighting potential downsides of RCPG such as not robustifying the full constrained objective and the lack of incremental learning, this paper introduces two algorithms, called RCPG with Robust Lagrangian and Adversarial RCPG. RCPG with Robust Lagrangian modifies RCPG by taking the worst-case dynamics based on the Lagrangian rather than either the value or the constraint. Adversarial RCPG also formulates the worst-case dynamics based on the Lagrangian but learns this directly and incrementally as an adversarial policy through gradient descent rather than indirectly and abruptly through constrained optimisation on a sorted value list. A theoretical analysis first derives the Lagrangian policy gradient for the policy optimisation of both proposed algorithms and then the adversarial policy gradient to learn the adversary for Adversarial RCPG. Empirical experiments injecting perturbations in inventory management and safe navigation tasks demonstrate the competitive performance of both algorithms compared to traditional RCPG variants as well as non-robust and non-constrained ablations. In particular, Adversarial RCPG ranks among the top two performing algorithms on all tests.

翻译：鲁棒约束马尔可夫决策过程（RCPG）是强化学习中一种近期提出的任务建模框架，它融合了行为约束，并通过使用不确定集来增强对转移动力学模型误差的鲁棒性。模拟RCMDP需要基于每个状态的值估计计算最坏情况动力学，这一方法此前已在鲁棒约束策略梯度（RCPG）中得到应用。本文针对RCPG的潜在缺陷，例如未能使完整约束目标鲁棒化以及缺乏增量学习，提出了两种算法：带鲁棒拉格朗日的RCPG和对抗RCPG。带鲁棒拉格朗日的RCPG通过基于拉格朗日量（而非值或约束）计算最坏情况动力学来改进RCPG。对抗RCPG同样基于拉格朗日量构建最坏情况动力学，但通过梯度下降直接以对抗策略的形式增量式学习，而非通过排序值列表上的约束优化进行间接且突变的计算。理论分析首先推导了两种算法在策略优化中的拉格朗日策略梯度，然后推导了用于学习对抗RCPG中对抗者的对抗策略梯度。在库存管理和安全导航任务中注入扰动的实证实验表明，与传统的RCPG变体以及非鲁棒和非约束的消融方法相比，两种算法均展现了有竞争力的性能。特别地，对抗RCPG在所有测试中均排名前二。