We propose an automata-theoretic approach for reinforcement learning (RL) under complex spatio-temporal constraints with time windows. The problem is formulated using a Markov decision process under a bounded temporal logic constraint. Different from existing RL methods that can eventually learn optimal policies satisfying such constraints, our proposed approach enforces a desired probability of constraint satisfaction throughout learning. This is achieved by translating the bounded temporal logic constraint into a total automaton and avoiding "unsafe" actions based on the available prior information regarding the transition probabilities, i.e., a pair of upper and lower bounds for each transition probability. We provide theoretical guarantees on the resulting probability of constraint satisfaction. We also provide numerical results in a scenario where a robot explores the environment to discover high-reward regions while fulfilling some periodic pick-up and delivery tasks that are encoded as temporal logic constraints.
翻译:本文提出一种基于自动机理论的强化学习方法,用于处理带时间窗口的复杂时空约束问题。该问题通过马尔可夫决策过程在有限时间逻辑约束下进行建模。与现有最终能学习到满足此类约束的最优策略的强化学习方法不同,本文提出的方法在整个学习过程中强制满足约束的期望概率。该方法通过将有限时间逻辑约束转化为全自动机,并基于可用的转移概率先验信息(即每个转移概率的上下界)避开"不安全"动作来实现。我们给出了约束满足概率的理论保证,并在机器人探索环境以发现高奖励区域同时执行以时间逻辑约束编码的周期性取送任务场景中提供了数值结果。