Robust Markov decision processes (RMDPs) extend standard Markov decision processes (MDPs) to account for uncertainty in the transition probabilities. RMDPs have an uncertainty set that defines a set of possible transition functions, each of which induces a standard MDP. The natural objective in an RMDP is to optimize the discounted cumulative reward under the worst-case transition function in the uncertainty set. We study the complexity of the associated threshold problem for RMDPs with polytopic uncertainty sets in halfspace representation. Previous results focused on approximating the optimum or restricted attention to specific subclasses of RMDPs, such as interval MDPs or $L_\infty$-RMDPs. Our contributions are threefold: (1) For (s,a)-rectangular RMDPs, we prove that robust policy evaluation is in P via robust linear programming, and that the threshold problem is in NP. As a corollary, robust policy iteration is a polynomial-time algorithm for these RMDPs when the discount factor is fixed. (2) For $s$-rectangular RMDPs, we show that the threshold problem is in PSPACE via the first-order theory of the reals. (3) We establish lower bounds by reducing both parity games and bisimulation metrics between MDP states to the RMDP threshold problem. A polynomial-time algorithm for the threshold problem would resolve the long-standing open question of whether parity games can be solved in polynomial time. The reduction from bisimulation metrics also yields a practical benefit: it allows us to apply robust policy iteration as a more efficient alternative to the standard fixed-point iteration, as our empirical evaluation demonstrates.
翻译:鲁棒马尔可夫决策过程(RMDP)将标准马尔可夫决策过程(MDP)扩展到考虑转移概率的不确定性。RMDP具有一个不确定性集,该集合定义了一组可能的转移函数,每个转移函数都诱导出一个标准MDP。RMDP中的自然目标是在不确定性集中的最坏转移函数下优化折扣累积奖励。我们研究了半空间表示中具有多面体不确定性集的RMDP相关阈值问题的复杂度。先前的研究主要关注最优值的近似或局限于RMDP的特定子类,例如区间MDP或$L_\infty$-RMDP。我们的贡献有三方面:(1) 对于(s,a)-矩形RMDP,我们通过鲁棒线性规划证明了鲁棒策略评估属于P类,且阈值问题属于NP类。作为推论,当折扣因子固定时,鲁棒策略迭代是这些RMDP的多项式时间算法。(2) 对于$s$-矩形RMDP,我们通过实数的一阶理论证明了阈值问题属于PSPACE类。(3) 我们通过将奇偶游戏和MDP状态间的互模拟度量归约到RMDP阈值问题来建立下界。阈值问题的多项式时间算法将解决奇偶游戏是否可在多项式时间内求解这一长期悬而未决的问题。从互模拟度量的归约也带来了实际效益:如我们的实证评估所示,它允许我们应用鲁棒策略迭代作为标准不动点迭代的更高效替代方案。