Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes

The robust constrained Markov decision process (RCMDP) is a recent task-modelling framework for reinforcement learning that incorporates behavioural constraints and that provides robustness to errors in the transition dynamics model through the use of an uncertainty set. Simulating RCMDPs requires computing the worst-case dynamics based on value estimates for each state, an approach which has previously been used in the Robust Constrained Policy Gradient (RCPG). Highlighting potential downsides of RCPG such as not robustifying the full constrained objective and the lack of incremental learning, this paper introduces two algorithms, called RCPG with Robust Lagrangian and Adversarial RCPG. RCPG with Robust Lagrangian modifies RCPG by taking the worst-case dynamics based on the Lagrangian rather than either the value or the constraint. Adversarial RCPG also formulates the worst-case dynamics based on the Lagrangian but learns this directly and incrementally as an adversarial policy through gradient descent rather than indirectly and abruptly through constrained optimisation on a sorted value list. A theoretical analysis first derives the Lagrangian policy gradient for the policy optimisation of both proposed algorithms and then the adversarial policy gradient to learn the adversary for Adversarial RCPG. Empirical experiments injecting perturbations in inventory management and safe navigation tasks demonstrate the competitive performance of both algorithms compared to traditional RCPG variants as well as non-robust and non-constrained ablations. In particular, Adversarial RCPG ranks among the top two performing algorithms on all tests.

翻译：鲁棒约束马尔可夫决策过程（RCMDP）是近期针对强化学习提出的任务建模框架，其通过不确定性集整合行为约束，并实现对转移动力学模型误差的鲁棒性。模拟RCMDP需基于各状态的价值估计计算最坏情况动力学，该方法此前已被用于鲁棒约束策略梯度（RCPG）算法中。本文针对RCPG存在的不足，如其未能对完整约束目标实现鲁棒化以及缺乏增量学习能力，提出了两种新算法：鲁棒拉格朗日RCPG和对抗RCPG。鲁棒拉格朗日RCPG通过基于拉格朗日量而非价值或约束条件计算最坏情况动力学，对RCPG进行了改进。对抗RCPG同样基于拉格朗日量构建最坏情况动力学，但通过梯度下降方法将其作为对抗策略进行直接增量学习，而非通过对排序价值列表进行约束优化的间接突变方式。理论分析首先推导了两种算法策略优化所需的拉格朗日策略梯度，继而推导了对抗RCPG中学习对抗策略所需的对抗策略梯度。在库存管理与安全导航任务中注入扰动的实证实验表明，相较于传统RCPG变体及非鲁棒/非约束消融模型，两种算法均展现出竞争性性能。特别地，对抗RCPG在所有测试中均位列性能最优的前两名算法之列。