We present a novel robust policy gradient method (RPG) for s-rectangular robust Markov Decision Processes (MDPs). We are the first to derive the adversarial kernel in a closed form and demonstrate that it is a one-rank perturbation of the nominal kernel. This allows us to derive an RPG that is similar to the one used in non-robust MDPs, except with a robust Q-value function and an additional correction term. Both robust Q-values and correction terms are efficiently computable, thus the time complexity of our method matches that of non-robust MDPs, which is significantly faster compared to existing black box methods.
翻译:本文提出了一种新颖的鲁棒策略梯度方法(RPG),用于s-矩形鲁棒马尔可夫决策过程(MDPs)。我们首次以闭合形式推导出对抗核,并证明其是名义核的一秩扰动。这一发现使我们能够推导出一种类似于非鲁棒MDPs中所用的RPG方法,区别仅在于鲁棒Q值函数和额外的修正项。由于鲁棒Q值与修正项均可高效计算,本方法的时间复杂度与非鲁棒MDPs相当,且显著快于现有黑箱方法。