Markov Decision Process (MDP) is a common mathematical model for sequential decision-making problems. In this paper, we present a new geometric interpretation of MDP, which is useful for analyzing the dynamics of main MDP algorithms. Based on this interpretation, we demonstrate that MDPs can be split into equivalence classes with indistinguishable algorithm dynamics. The related normalization procedure allows for the design of a new class of MDP-solving algorithms that find optimal policies without computing policy values.
翻译:马尔可夫决策过程(MDP)是序列决策问题的常用数学模型。本文提出一种新的MDP几何解释,该解释有助于分析主要MDP算法的动态特性。基于此解释,我们证明MDP可被划分为具有不可区分算法动态的等价类。相关归一化流程使得能够设计新型MDP求解算法,该类算法无需计算策略价值即可找到最优策略。