Canonical models of Markov decision processes (MDPs) usually consider geometric discounting based on a constant discount factor. While this standard modeling approach has led to many elegant results, some recent studies indicate the necessity of modeling time-varying discounting in certain applications. This paper studies a model of infinite-horizon MDPs with time-varying discount factors. We take a game-theoretic perspective -- whereby each time step is treated as an independent decision maker with their own (fixed) discount factor -- and we study the subgame perfect equilibrium (SPE) of the resulting game as well as the related algorithmic problems. We present a constructive proof of the existence of an SPE and demonstrate the EXPTIME-hardness of computing an SPE. We also turn to the approximate notion of $\epsilon$-SPE and show that an $\epsilon$-SPE exists under milder assumptions. An algorithm is presented to compute an $\epsilon$-SPE, of which an upper bound of the time complexity, as a function of the convergence property of the time-varying discount factor, is provided.
翻译:标准马尔可夫决策过程(MDP)模型通常采用基于恒定折现因子的几何折现。虽然这种标准建模方法已产生许多优雅的理论成果,但近期研究表明在某些应用中需要建模时变折现。本文研究具有时变折现因子的无限时域马尔可夫决策过程模型。我们采用博弈论视角——将每个时间步视为具有自身(固定)折现因子的独立决策者——并研究由此产生的博弈的子博弈完美均衡(SPE)及相关算法问题。我们给出了SPE存在性的构造性证明,并论证了计算SPE的EXPTIME困难性。此外,我们转向近似概念$\epsilon$-SPE,证明在更宽松的假设条件下存在$\epsilon$-SPE。文中提出了一个计算$\epsilon$-SPE的算法,并给出了其时间复杂度上界,该上界由时变折现因子的收敛性质所决定。