Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such translations manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objectives (expressed in logic or automata formalism) to scalar rewards for discrete-time Markov decision processes (MDPs). Unfortunately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega-regular languages generalize regular languages to infinite-horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense-time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spending positive time in the good states, and 2) expectation semantics where the goal of the learner is to optimize the long-run expected average time spent in the ``good states" of the automaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the-shelf RL algorithms for CTMDPs. We demonstrate the effectiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.
翻译:连续时间马尔可夫决策过程(CTMDP)是描述密集时间与随机环境下序贯决策的经典模型。当环境的随机演化仅能通过采样获取时,无模型强化学习(RL)是计算最优决策序列的首选算法。然而,RL要求将学习目标编码为标量奖励信号。由于手动进行此类转换既繁琐又易出错,已有多种技术被提出用于将高级目标(以逻辑或自动机形式化表达)转换为离散时间马尔可夫决策过程(MDP)的标量奖励。遗憾的是,针对CTMDP尚不存在自动转换方法。本文考虑以Omega-正则语言表达的学习目标下的CTMDP环境。Omega-正则语言将正则语言推广至无限时域规范,可表达常见线性时序逻辑LTL中的性质。为适应CTMDP的密集时间特性,我们探讨Omega-正则目标的两种语义:1)满足语义,即学习者的目标是最大化在好状态中花费正时间的概率;2)期望语义,即学习者的目标是优化在自动机"好状态"中花费的长期期望平均时间。我们提出一种方法,能够正确转换为可直接用于CTMDP现有RL算法的标量奖励信号。通过在含Omega-正则目标的典型CTMDP基准测试上进行评估,我们验证了所提算法的有效性。