The robust constrained Markov decision process (RCMDP) is a recent task-modelling framework for reinforcement learning that incorporates behavioural constraints and that provides robustness to errors in the transition dynamics model through the use of an uncertainty set. Simulating RCMDPs requires computing the worst-case dynamics based on value estimates for each state, an approach which has previously been used in the Robust Constrained Policy Gradient (RCPG). Highlighting potential downsides of RCPG such as not robustifying the full constrained objective and the lack of incremental learning, this paper introduces two algorithms, called RCPG with Robust Lagrangian and Adversarial RCPG. RCPG with Robust Lagrangian modifies RCPG by taking the worst-case dynamics based on the Lagrangian rather than either the value or the constraint. Adversarial RCPG also formulates the worst-case dynamics based on the Lagrangian but learns this directly and incrementally as an adversarial policy through gradient descent rather than indirectly and abruptly through constrained optimisation on a sorted value list. A theoretical analysis first derives the Lagrangian policy gradient for the policy optimisation of both proposed algorithms and then the adversarial policy gradient to learn the adversary for Adversarial RCPG. Empirical experiments injecting perturbations in inventory management and safe navigation tasks demonstrate the competitive performance of both algorithms compared to traditional RCPG variants as well as non-robust and non-constrained ablations. In particular, Adversarial RCPG ranks among the top two performing algorithms on all tests.
翻译:鲁棒约束马尔可夫决策过程(RCPG)是强化学习中一种新型任务建模框架,它融合了行为约束并通过不确定性集对转移动力学模型误差提供鲁棒性。模拟RCMDP需基于各状态的价值估计计算最坏情况动力学,该方法已在鲁棒约束策略梯度(RCPG)中得到应用。本文针对RCPG存在的未完全鲁棒化约束目标及缺乏增量学习等潜在缺陷,提出两种新算法:基于鲁棒拉格朗日的RCPG和对抗性RCPG。前者通过基于拉格朗日函数(而非价值或约束)计算最坏情况动力学来改进RCPG;后者同样基于拉格朗日函数构建最坏情况动力学,但通过梯度下降将对抗策略作为增量学习目标直接学习,而非通过排序价值列表的约束优化进行间接突变更新。理论分析首先推导了两种算法策略优化所需的拉格朗日策略梯度,随后给出了对抗性RCPG中学习对抗策略的对抗策略梯度。在库存管理与安全导航任务中注入扰动的实验表明,两种算法相比传统RCPG变体及非鲁棒/无约束消融模型均展现出竞争性能,其中对抗性RCPG在所有测试中位列前两名。