Modified policy iteration (MPI) is a dynamic programming algorithm that combines elements of policy iteration and value iteration. The convergence of MPI has been well studied in the context of discounted and average-cost MDPs. In this work, we consider the exponential cost risk-sensitive MDP formulation, which is known to provide some robustness to model parameters. Although policy iteration and value iteration have been well studied in the context of risk sensitive MDPs, MPI is unexplored. We provide the first proof that MPI also converges for the risk-sensitive problem in the case of finite state and action spaces. Since the exponential cost formulation deals with the multiplicative Bellman equation, our main contribution is a convergence proof which is quite different than existing results for discounted and risk-neutral average-cost problems as well as risk sensitive value and policy iteration approaches. We conclude our analysis with simulation results, assessing MPI's performance relative to alternative dynamic programming methods like value iteration and policy iteration across diverse problem parameters. Our findings highlight risk-sensitive MPI's enhanced computational efficiency compared to both value and policy iteration techniques.
翻译:改进策略迭代(MPI)是一种融合策略迭代与值迭代元素的动态规划算法。在折扣成本与平均成本马尔可夫决策过程中,MPI的收敛性已得到充分研究。本文考虑指数成本风险敏感型MDP公式,该公式已知能对模型参数提供一定鲁棒性。尽管策略迭代和值迭代在风险敏感MDP中已被深入研究,但MPI尚未被探索。我们首次证明,在有限状态和动作空间情形下,MPI同样收敛于风险敏感问题。由于指数成本公式涉及乘法型贝尔曼方程,我们的核心贡献在于提出一种与现有折扣成本及风险中性平均成本问题、以及风险敏感值迭代和策略迭代方法存在显著差异的收敛性证明。最后,我们通过仿真实验,评估了MPI相对于值迭代和策略迭代等替代动态规划方法在不同问题参数下的性能。研究结果表明,与值迭代和策略迭代技术相比,风险敏感MPI在计算效率上更具优势。