We propose policy gradient algorithms for robust infinite-horizon Markov decision processes (MDPs) with non-rectangular uncertainty sets, thereby addressing an open challenge in the robust MDP literature. Indeed, uncertainty sets that display statistical optimality properties and make optimal use of limited data often fail to be rectangular. Unfortunately, the corresponding robust MDPs cannot be solved with dynamic programming techniques and are in fact provably intractable. We first present a randomized projected Langevin dynamics algorithm that solves the robust policy evaluation problem to global optimality but is inefficient. We also propose a deterministic policy gradient method that is efficient but solves the robust policy evaluation problem only approximately, and we prove that the approximation error scales with a new measure of non-rectangularity of the uncertainty set. Finally, we describe an actor-critic algorithm that finds an $\epsilon$-optimal solution for the robust policy improvement problem in $\mathcal{O}(1/\epsilon^4)$ iterations. We thus present the first complete solution scheme for robust MDPs with non-rectangular uncertainty sets offering global optimality guarantees. Numerical experiments show that our algorithms compare favorably against state-of-the-art methods.
翻译:我们针对具有非矩形不确定集的鲁棒无限时域马尔可夫决策过程(MDP)提出了策略梯度算法,从而解决了鲁棒MDP文献中一个开放性挑战。实际上,具有统计最优性并能最优利用有限数据的不确定集往往不是矩形的。遗憾的是,相应的鲁棒MDP无法通过动态规划技术求解,且已被证明是计算上难以处理的。我们首先提出一种随机投影Langevin动力学算法,该算法能全局最优地解决鲁棒策略评估问题,但效率较低。我们还提出一种确定性策略梯度方法,该方法效率较高但仅能近似求解鲁棒策略评估问题,我们证明其近似误差与不确定集的一种新的非矩形性度量呈比例关系。最后,我们描述了一种执行者-评论家算法,该算法能在$\mathcal{O}(1/\epsilon^4)$次迭代内为鲁棒策略改进问题找到$\epsilon$-最优解。由此,我们首次提出了针对非矩形不确定集鲁棒MDP的完整求解方案,并提供了全局最优性保证。数值实验表明,我们的算法与最先进方法相比具有竞争力。