Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realization of that uncertainty. In particular, $(s, a)$-rectangular RMDPs with $L_\infty$ uncertainty sets form a fundamental and expressive model: they subsume classical MDPs and turn-based stochastic games. We consider this model with discounted payoffs. The existence of polynomial and strongly-polynomial time algorithms is a fundamental problem for these optimization models. For MDPs, linear programming yields polynomial-time algorithms for any arbitrary discount factor, and the seminal work of Ye established strongly--polynomial time for a fixed discount factor. The generalization of such results to RMDPs has remained an important open problem. In this work, we show that a robust policy iteration algorithm runs in strongly-polynomial time for $(s, a)$-rectangular $L_\infty$ RMDPs with a constant (fixed) discount factor, resolving an important algorithmic question.
翻译:马可夫决策过程(MDPs)是序列决策中的基础模型。鲁棒MDPs(RMDPs)通过允许转移概率存在不确定性并针对该不确定性的最坏情况实现进行优化,扩展了这一框架。特别地,具有$L_\infty$不确定性集合的$(s, a)$-矩形RMDP形成了一个基础且富有表现力的模型:它们涵盖了经典MDP和回合制随机博弈。我们考虑具有折扣收益的这一模型。多项式时间和强多项式时间算法的存在性是这些优化模型的基本问题。对于MDP,线性规划对任意折扣因子均提供多项式时间算法,而Ye的开创性工作确立了固定折扣因子下的强多项式时间。将这些结果推广到RMDP仍是一个重要的开放问题。在本工作中,我们证明对于具有恒定(固定)折扣因子的$(s, a)$-矩形$L_\infty$ RMDP,鲁棒策略迭代算法可在强多项式时间内运行,从而解决了一个重要的算法问题。