论适用于马尔可夫决策过程的自动机 (On Good-for-MDPs Automata)

Nondeterministic good-for-MDPs (GFM) automata are for MDP model checking and reinforcement learning what good-for-games (GFG) automata are for reactive synthesis: a more compact alternative to deterministic automata that displays nondeterminism, but only so much that it can be resolved locally, such that a syntactic product can be analysed. GFM has recently been introduced as a property for reinforcement learning, where the simpler Büchi acceptance conditions it allows to use is key. However, while there are classic and novel techniques to obtain automata that are GFM, there has not been a decision procedure for checking whether or not an automaton is GFM. We show that GFM-ness is decidable and provide an EXPTIME decision procedure as well as a PSPACE-hardness proof. We also compare the succinctness of GFM automata with other types of automata with restricted nondeterminism. The first natural comparison point are GFG automata. Deterministic automata are GFG, and GFG automata are GFM, but not vice versa. This raises the question of how these classes relate in terms of succinctness. GFG automata are known to be exponentially more succinct than deterministic automata, but the gap between GFM and GFG automata as well as the gap between ordinary nondeterministic automata and those that are GFM have been open. We establish that these gaps are exponential, and sharpen this result by showing that the latter gap remains exponential when restricting the nondeterministic automata to separating safety or unambiguous reachability automata.

翻译：非确定性的适用于马尔可夫决策过程的自动机对于MDP模型检验和强化学习而言，正如适用于博弈的自动机对于反应式综合一样：它是一种比确定性自动机更紧凑的替代方案，虽然表现出非确定性，但其非确定性程度仅限于可在局部消解，从而能够对语法积进行分析。GFM最近被引入作为强化学习的一种性质，其允许使用的更简单的Büchi接受条件至关重要。然而，尽管存在经典和新颖的技术来获得GFM自动机，但一直缺乏用于检验一个自动机是否具有GFM性质的判定过程。我们证明了GFM性质是可判定的，并给出了一个EXPTIME判定过程以及一个PSPACE难度的证明。我们还比较了GFM自动机与其他类型具有受限非确定性的自动机的简洁性。第一个自然的比较点是GFG自动机。确定性自动机是GFG的，且GFG自动机是GFM的，但反之则不成立。这引出了这些类别在简洁性方面如何关联的问题。已知GFG自动机比确定性自动机指数级更简洁，但GFM与GFG自动机之间的差距，以及普通非确定性自动机与GFM自动机之间的差距，此前尚未明确。我们证明这些差距是指数级的，并通过证明当将非确定性自动机限制为分离的安全性自动机或无歧义可达性自动机时，后一差距仍保持指数级，从而强化了这一结果。

相关内容

马尔可夫决策过程

关注 23

马尔可夫决策过程（MDP）提供了一个数学框架，用于在结果部分随机且部分受决策者控制的情况下对决策建模。 MDP可用于研究通过动态编程和强化学习解决的各种优化问题。 MDP至少早在1950年代就已为人所知（参见）。马尔可夫决策过程的研究核心是罗纳德·霍华德（Ronald A. Howard）于1960年出版的《动态编程和马尔可夫过程》一书。它们被广泛用于各种学科，包括机器人技术，自动控制，经济学和制造。更精确地，马尔可夫决策过程是离散的时间随机控制过程。在每个时间步骤中，流程都处于某种状态，决策者可以选择该状态下可用的任何操作。该过程在下一时间步响应，随机进入新状态，并给予决策者相应的奖励。流程进入新状态的可能性受所选动作的影响。具体而言，它由状态转换函数给出。因此，下一个状态取决于当前状态和决策者的动作。但是给定和，它有条件地独立于所有先前的状态和动作；换句话说，MDP进程的状态转换满足Markov属性。马尔可夫决策过程是马尔可夫链的扩展。区别在于增加了动作（允许选择）和奖励（给予动机）。相反，如果每个状态仅存在一个动作（例如“等待”）并且所有奖励都相同（例如“零”），则马尔可夫决策过程将简化为马尔可夫链。

《自适应鲁棒马尔可夫决策过程：协同作战飞机（CCA）对抗性监视任务应用》44页技术报告

专知会员服务

25+阅读 · 2025年12月9日

【博士论文】《用于可验证数学自动化的语言模型：交互、集成与自动形式化》

专知会员服务

19+阅读 · 2025年3月14日

【CMU博士论文】可验证数学自动化的语言模型交互、集成与自动形式化

专知会员服务

21+阅读 · 2025年2月14日

加强军事决策过程（MDMP）：审视快速决策与同步（RDSP）、过程自动化和人工智能集成

专知会员服务

34+阅读 · 2024年6月15日