Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realization of that uncertainty. In particular, $(s, a)$-rectangular RMDPs with $L_\infty$ uncertainty sets form a fundamental and expressive model: they subsume classical MDPs and turn-based stochastic games. We consider this model with discounted payoffs. The existence of polynomial and strongly-polynomial time algorithms is a fundamental problem for these optimization models. For MDPs, linear programming yields polynomial-time algorithms for any arbitrary discount factor, and the seminal work of Ye established strongly--polynomial time for a fixed discount factor. The generalization of such results to RMDPs has remained an important open problem. In this work, we show that a robust policy iteration algorithm runs in strongly-polynomial time for $(s, a)$-rectangular $L_\infty$ RMDPs with a constant (fixed) discount factor, resolving an important algorithmic question.
翻译:马尔可夫决策过程(MDPs)是序贯决策中的基础模型。鲁棒马尔可夫决策过程(RMDPs)通过允许转移概率存在不确定性,并针对该不确定性的最坏情况实现进行优化,扩展了这一框架。特别地,具有 $L_\infty$ 不确定性集的 $(s, a)$-矩形 RMDPs 构成了一个基础且富有表达力的模型:它们包含了经典的 MDPs 和回合制随机博弈。我们考虑该模型在折现回报下的情形。对于这些优化模型,是否存在多项式时间及强多项式时间算法是一个基本问题。对于 MDPs,线性规划可为任意折现因子提供多项式时间算法,而 Ye 的开创性工作则为固定折现因子确立了强多项式时间。将此类结果推广到 RMDPs 一直是一个重要的开放性问题。在本工作中,我们证明了一个鲁棒策略迭代算法对于具有常数(固定)折现因子的 $(s, a)$-矩形 $L_\infty$ RMDPs 可在强多项式时间内运行,从而解决了一个重要的算法问题。