In this paper, we study planning in stochastic systems, modeled as Markov decision processes (MDPs), with preferences over temporally extended goals. Prior work on temporal planning with preferences assumes that the user preferences form a total order, meaning that every pair of outcomes are comparable with each other. In this work, we consider the case where the preferences over possible outcomes are a partial order rather than a total order. We first introduce a variant of deterministic finite automaton, referred to as a preference DFA, for specifying the user's preferences over temporally extended goals. Based on the order theory, we translate the preference DFA to a preference relation over policies for probabilistic planning in a labeled MDP. In this treatment, a most preferred policy induces a weak-stochastic nondominated probability distribution over the finite paths in the MDP. The proposed planning algorithm hinges on the construction of a multi-objective MDP. We prove that a weak-stochastic nondominated policy given the preference specification is Pareto-optimal in the constructed multi-objective MDP, and vice versa. Throughout the paper, we employ a running example to demonstrate the proposed preference specification and solution approaches. We show the efficacy of our algorithm using the example with detailed analysis, and then discuss possible future directions.
翻译:本文研究了在随机系统(建模为马尔可夫决策过程)中针对时间扩展目标的偏好规划问题。先前关于带偏好时间规划的工作假设用户偏好构成全序关系,即每对结果均可相互比较。本研究考虑了可能结果偏好为偏序而非全序的情况。我们首先引入一种确定性有限自动机的变体,称为偏好DFA,用于指定用户对时间扩展目标的偏好。基于序理论,我们将偏好DFA转化为带标签MDP中概率规划策略上的偏好关系。在此处理中,最优策略会在MDP有限路径上诱导弱随机非支配概率分布。所提出的规划算法依赖于多目标马尔可夫决策过程的构建。我们证明,在偏好规范下,弱随机非支配策略在构建的多目标MDP中是帕累托最优的,反之亦然。全文采用运行实例演示所提出的偏好规范与求解方法,通过详细分析展示算法有效性,并讨论未来可能的研究方向。