We develop a generic policy gradient method with the global optimality guarantee for robust Markov Decision Processes (MDPs). While policy gradient methods are widely used for solving dynamic decision problems due to their scalable and efficient nature, adapting these methods to account for model ambiguity has been challenging, often making it impractical to learn robust policies. This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs. DRPMD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy. We provide a comprehensive analysis of DRPMD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA). Additionally, we propose innovative parametric transition kernels for both discrete and continuous state-action spaces, broadening the applicability of our approach. Empirical results validate the robustness and global convergence of DRPMD across various challenging robust MDP settings.
翻译:我们提出了一种具有全局最优性保证的通用策略梯度方法,用于求解鲁棒马尔可夫决策过程。尽管策略梯度方法因其可扩展性和高效性而被广泛用于求解动态决策问题,但如何调整这些方法以应对模型不确定性一直是个挑战,这常常使得学习鲁棒策略变得不切实际。本文提出了一种新颖的策略梯度方法——双循环鲁棒策略镜像下降法,用于求解鲁棒马尔可夫决策过程。该方法在策略优化中采用通用的镜像下降更新规则,并具有每轮迭代的自适应容差,从而保证收敛到全局最优策略。我们对双循环鲁棒策略镜像下降法进行了全面分析,包括在直接参数化和Softmax参数化下的新收敛性结果,并通过转移镜像上升法对内部问题的求解提供了新的见解。此外,我们针对离散和连续状态-动作空间提出了创新的参数化转移核,从而拓宽了我们方法的适用性。实证结果验证了双循环鲁棒策略镜像下降法在各种具有挑战性的鲁棒马尔可夫决策过程设置下的鲁棒性和全局收敛性。