Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.
翻译:线性时序逻辑(LTL)及作为其超集的ω正则目标——近年来被用于在强化学习中表达非马尔可夫目标。我们提出一种基于模型的概率近似正确(PAC)学习算法,用于马尔可夫决策过程(MDP)中的ω正则目标。在算法开发过程中,我们引入了ε-递归时间:该指标衡量策略渐近收敛至满足ω正则目标的速度。我们证明算法在相关参数上仅需多项式数量的样本,并通过实验验证了理论结果。