We consider lexicographic bi-objective problems on Markov Decision Processes (MDPs), where we optimize one objective while guaranteeing optimality of another. We propose a two-stage technique for solving such problems when the objectives are related (in a way that we formalize). We instantiate our technique for two natural pairs of objectives: minimizing the (conditional) expected number of steps to a target while guaranteeing the optimal probability of reaching it; and maximizing the (conditional) expected average reward while guaranteeing an optimal probability of staying safe (w.r.t. some safe set of states). For the first combination of objectives, which covers the classical frozen lake environment from reinforcement learning, we also report on experiments performed using a prototype implementation of our algorithm and compare it with what can be obtained from state-of-the-art probabilistic model checkers solving optimal reachability.
翻译:我们考虑马尔可夫决策过程(MDP)上的词典式双目标问题,即在保证一个目标最优的同时优化另一个目标。我们提出了一种两阶段技术,用于在目标相关(我们以形式化方式定义)时求解此类问题。我们将该技术应用于两类自然的目标组合:在保证最优到达概率的同时最小化(条件期望)到达目标的步数;以及在保证最优安全概率(相对于某个安全状态集)的同时最大化(条件期望)平均奖励。针对第一个目标组合(涵盖强化学习中的经典冰冻湖环境),我们还报告了使用算法原型实现进行的实验,并将其与通过最先进概率模型检验器求解最优可达性所得结果进行了比较。