Policy Iteration (PI) is a widely used family of algorithms to compute optimal policies for Markov Decision Problems (MDPs). We derive upper bounds on the running time of PI on Deterministic MDPs (DMDPs): the class of MDPs in which every state-action pair has a unique next state. Our results include a non-trivial upper bound that applies to the entire family of PI algorithms; another to all "max-gain" switching variants; and affirmation that a conjecture regarding Howard's PI on MDPs is true for DMDPs. Our analysis is based on certain graph-theoretic results, which may be of independent interest.
翻译:策略迭代(PI)是用于计算马尔可夫决策过程(MDP)最优策略的广泛应用算法族。本文推导了PI在确定性MDP(DMDP)——即每个状态-动作对均有唯一后继状态的MDP类别——上的运行时间上界。研究结果包括:适用于整个PI算法族的非平凡上界;针对所有"最大增益"切换变体的独立上界;以及证实了关于霍华德PI在MDP上成立的一个猜想在DMDP中成立。我们的分析基于若干图论结果,这些结果本身可能具有独立研究价值。