Markov chains are the de facto finite-state model for stochastic dynamical systems, and Markov decision processes (MDPs) extend Markov chains by incorporating non-deterministic behaviors. Given an MDP and rewards on states, a classical optimization criterion is the maximal expected total reward where the MDP stops after T steps, which can be computed by a simple dynamic programming algorithm. We consider a natural generalization of the problem where the stopping times can be chosen according to a probability distribution, such that the expected stopping time is T, to optimize the expected total reward. Quite surprisingly we establish inter-reducibility of the expected stopping-time problem for Markov chains with the Positivity problem (which is related to the well-known Skolem problem), for which establishing either decidability or undecidability would be a major breakthrough. Given the hardness of the exact problem, we consider the approximate version of the problem: we show that it can be solved in exponential time for Markov chains and in exponential space for MDPs.
翻译:马尔可夫链是随机动力系统事实上的有限状态模型,而马尔可夫决策过程(MDPs)通过纳入非确定性行为扩展了马尔可夫链。给定一个MDP及其状态上的奖励函数,经典的优化准则是最大期望总奖励,其中MDP在T步后停止,这可通过简单的动态规划算法计算。我们考虑该问题的一个自然推广:停止时间可根据概率分布选择,使得期望停止时间为T,以优化期望总奖励。令人惊讶的是,我们建立了马尔可夫链的期望停止时间问题与Positivity问题(与著名的Skolem问题相关)的相互可归约性,而确定该问题的可判定性或不可判定性将是一个重大突破。鉴于精确问题的困难性,我们考虑该问题的近似版本:我们证明对于马尔可夫链可在指数时间内求解,对于MDPs可在指数空间内求解。