Algorithmic analysis of Markov decision processes (MDP) and stochastic games (SG) in practice relies on value-iteration (VI) algorithms. Since the basic version of VI does not provide guarantees on the precision of the result, variants of VI have been proposed that offer such guarantees. In particular, sound value iteration (SVI) not only provides precise lower and upper bounds on the result, but also converges faster in the presence of probabilistic cycles. Unfortunately, it is neither applicable to SG, nor to MDP with end components. In this paper, we extend SVI and cover both cases. The technical challenge consists mainly in proper treatment of end components, which require different handling than in the literature. Moreover, we provide several optimizations of SVI. Finally, we also evaluate our prototype implementation experimentally to confirm its advantages on systems with probabilistic cycles.
翻译:马尔可夫决策过程(MDP)与随机博弈(SG)的算法分析在实践中依赖于值迭代(VI)算法。由于基础版本的值迭代算法无法保证结果精度,学界已提出多种能提供此类保证的改进算法。其中,可靠值迭代(SVI)不仅能提供精确的结果上下界,在存在概率循环时还具有更快的收敛速度。然而该算法既无法直接应用于随机博弈,也不能处理含终止分量的马尔可夫决策过程。本文通过扩展SVI算法覆盖了上述两类场景。技术挑战主要在于终止分量的正确处理,这需要采用与现有文献不同的处理方式。此外,我们提出了若干SVI优化方案。最后通过原型实验验证了该算法在含概率循环系统中的优势。