Markov decision processes (MDP) are a well-established model for sequential decision-making in the presence of probabilities. In robust MDP (RMDP), every action is associated with an uncertainty set of probability distributions, modelling that transition probabilities are not known precisely. Based on the known theoretical connection to stochastic games, we provide a framework for solving RMDPs that is generic, reliable, and efficient. It is *generic* both with respect to the model, allowing for a wide range of uncertainty sets, including but not limited to intervals, $L^1$- or $L^2$-balls, and polytopes; and with respect to the objective, including long-run average reward, undiscounted total reward, and stochastic shortest path. It is *reliable*, as our approach not only converges in the limit, but provides precision guarantees at any time during the computation. It is *efficient* because -- in contrast to state-of-the-art approaches -- it avoids explicitly constructing the underlying stochastic game. Consequently, our prototype implementation outperforms existing tools by several orders of magnitude and can solve RMDPs with a million states in under a minute.
翻译:马尔可夫决策过程(MDP)是概率环境下序列决策问题的成熟模型。在鲁棒马尔可夫决策过程(RMDP)中,每个动作关联一个概率分布的不确定性集合,用以建模转移概率不精确已知的情形。基于已知的与随机博弈的理论联系,我们提出了一个通用、可靠且高效的RMDP求解框架。其*通用性*体现在模型层面——支持包括但不限于区间、$L^1$-球或$L^2$-球以及多面体在内的多种不确定性集合;同时体现在目标层面——涵盖长期平均奖励、无折扣总奖励及随机最短路径等目标。其*可靠性*在于,我们的方法不仅能在极限情况下收敛,还能在计算过程中的任意时刻提供精度保证。其*高效性*源于——与现有先进方法不同——它避免了显式构建底层随机博弈。因此,我们的原型实现性能超越现有工具数个数量级,能够在一分钟内求解百万状态规模的RMDP。