Robotic systems operating in dynamic and uncertain environments increasingly require planners that satisfy complex task sequences while adhering to strict temporal constraints. Metric Interval Temporal Logic (MITL) offers a formal and expressive framework for specifying such time-bounded requirements; however, integrating MITL with reinforcement learning (RL) remains challenging due to stochastic dynamics and partial observability. This paper presents a unified automata-based RL framework for synthesizing policies in both Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) under MITL specifications. MITL formulas are translated into Timed Limit-Deterministic Generalized Büchi Automata (Timed-LDGBA) and synchronized with the underlying decision process to construct product timed models suitable for Q-learning. A simple yet expressive reward structure enforces temporal correctness while allowing additional performance objectives. The approach is validated in three simulation studies: a $5 \times 5$ grid-world formulated as an MDP, a $10 \times 10$ grid-world formulated as a POMDP, and an office-like service-robot scenario. Results demonstrate that the proposed framework consistently learns policies that satisfy strict time-bounded requirements under stochastic transitions, scales to larger state spaces, and remains effective in partially observable environments, highlighting its potential for reliable robotic planning in time-critical and uncertain settings.
翻译:在动态和不确定环境中运行的机器人系统,日益需要能够满足复杂任务序列并遵守严格时序约束的规划器。度量区间时序逻辑(MITL)为描述此类时间有界需求提供了一个形式化且富有表达力的框架;然而,由于随机动力学和部分可观测性,将MITL与强化学习(RL)相结合仍然具有挑战性。本文提出了一种统一的基于自动机的强化学习框架,用于在MITL规约下,为马尔可夫决策过程(MDP)和部分可观测马尔可夫决策过程(POMDP)综合策略。MITL公式被转换为时序极限确定性广义Büchi自动机(Timed-LDGBA),并与底层决策过程同步,以构建适用于Q学习的乘积时序模型。一个简单而富有表达力的奖励结构在强制执行时序正确性的同时,允许纳入额外的性能目标。该方法在三个仿真研究中得到验证:一个表述为MDP的$5 \times 5$网格世界、一个表述为POMDP的$10 \times 10$网格世界,以及一个类似办公室的服务机器人场景。结果表明,所提出的框架能够持续学习到满足严格时间有界要求的策略,适用于随机状态转移,可扩展到更大的状态空间,并且在部分可观测环境中仍然有效,突显了其在时间关键和不确定环境下实现可靠机器人规划的潜力。