Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.
翻译:线性时序逻辑(LTL)及其超集ω-正则目标——近期被用作强化学习中表达非马尔可夫目标的方式。我们提出了一种基于模型的可能近似正确(PAC)学习算法,用于马尔可夫决策过程(MDP)中的ω-正则目标。在算法开发过程中,我们引入了ε-递归时间:衡量策略在极限条件下收敛至满足ω-正则目标速度的指标。我们证明了该算法在相关参数下仅需多项式数量的样本,并通过实验验证了理论结果。