Algorithmic analysis of Markov decision processes (MDP) and stochastic games (SG) in practice relies on value-iteration (VI) algorithms. Since the basic version of VI does not provide guarantees on the precision of the result, variants of VI have been proposed that offer such guarantees. In particular, sound value iteration (SVI) not only provides precise lower and upper bounds on the result, but also converges faster in the presence of probabilistic cycles. Unfortunately, it is neither applicable to SG, nor to MDP with end components. In this paper, we extend SVI and cover both cases. The technical challenge consists mainly in proper treatment of end components, which require different handling than in the literature. Moreover, we provide several optimizations of SVI. Finally, we also evaluate our prototype implementation experimentally to confirm its advantages on systems with probabilistic cycles.
翻译:马尔可夫决策过程(MDP)与随机博弈(SG)的算法分析在实际应用中依赖于值迭代(VI)算法。由于基本VI版本无法提供结果精度的保证,研究者提出了能提供此类保证的VI变体。其中,可靠值迭代(SVI)不仅能给出结果的精确下界与上界,而且在存在概率循环时收敛速度更快。然而,该算法既无法适用于SG,也无法处理包含末端组件的MDP。本文对SVI进行了扩展,使其覆盖上述两种情况。技术挑战主要在于末端组件的正确处理——这需要采用与现有文献不同的处理方式。此外,我们还提出了SVI的若干优化方法,并通过原型实现的实验评估验证了其在含概率循环系统上的性能优势。