Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming. In this paper, we propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP. By introducing the concepts of pseudo mean and pseudo variance, we convert the original problem to a bilevel MDP problem, where the inner one is a standard MDP optimizing pseudo mean-variance and the outer one is a single parameter selection problem optimizing pseudo mean. We use the sensitivity analysis of MDPs to derive the properties of this bilevel problem. By solving inner standard MDPs for pseudo mean-variance optimization, we can identify worse policy spaces dominated by optimal policies of the pseudo problems. We propose an optimization algorithm which can find the globally optimal policy by repeatedly removing worse policy spaces. The convergence and complexity of the algorithm are studied. Another policy dominance property is also proposed to further improve the algorithm efficiency. Numerical experiments demonstrate the performance and efficiency of our algorithms. To the best of our knowledge, our algorithm is the first that efficiently finds the globally optimal policy of mean-variance optimization in MDPs. These results are also valid for solely minimizing the variance metrics in MDPs.
翻译:马尔可夫决策过程(MDPs)中均值与方差的动态优化是一个长期存在的挑战,其根源在于动态规划方法的失效。本文提出了一种新方法,用于在无限时域无折扣MDP中寻找稳态均值与方差组合度量的全局最优策略。通过引入伪均值与伪方差概念,我们将原问题转化为双层MDP问题:内层为优化伪均值-方差的标准MDP,外层为优化伪均值的单参数选择问题。我们利用MDP的灵敏度分析推导了该双层问题的性质。通过求解内层伪均值-方差优化的标准MDP,可识别出被伪问题最优策略支配的劣策略空间。我们提出了一种通过反复剔除劣策略空间来寻找全局最优策略的优化算法,并研究了其收敛性与复杂度。另提出一种策略支配性质以进一步提升算法效率。数值实验验证了算法的性能与效率。据我们所知,该算法是首个能高效求解MDP中均值-方差优化全局最优策略的方法。这些结果同样适用于MDP中仅对方差度量进行最小化的情形。