具有时序约束的机器人运动规划强化学习 (Reinforcement learning with timed constraints for robotics motion planning)

Robotic systems operating in dynamic and uncertain environments increasingly require planners that satisfy complex task sequences while adhering to strict temporal constraints. Metric Interval Temporal Logic (MITL) offers a formal and expressive framework for specifying such time-bounded requirements; however, integrating MITL with reinforcement learning (RL) remains challenging due to stochastic dynamics and partial observability. This paper presents a unified automata-based RL framework for synthesizing policies in both Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) under MITL specifications. MITL formulas are translated into Timed Limit-Deterministic Generalized Büchi Automata (Timed-LDGBA) and synchronized with the underlying decision process to construct product timed models suitable for Q-learning. A simple yet expressive reward structure enforces temporal correctness while allowing additional performance objectives. The approach is validated in three simulation studies: a $5 \times 5$ grid-world formulated as an MDP, a $10 \times 10$ grid-world formulated as a POMDP, and an office-like service-robot scenario. Results demonstrate that the proposed framework consistently learns policies that satisfy strict time-bounded requirements under stochastic transitions, scales to larger state spaces, and remains effective in partially observable environments, highlighting its potential for reliable robotic planning in time-critical and uncertain settings.

翻译：在动态和不确定环境中运行的机器人系统，日益需要能够满足复杂任务序列并遵守严格时序约束的规划器。度量区间时序逻辑（MITL）为描述此类时间有界需求提供了一个形式化且富有表达力的框架；然而，由于随机动力学和部分可观测性，将MITL与强化学习（RL）相结合仍然具有挑战性。本文提出了一种统一的基于自动机的强化学习框架，用于在MITL规约下，为马尔可夫决策过程（MDP）和部分可观测马尔可夫决策过程（POMDP）综合策略。MITL公式被转换为时序极限确定性广义Büchi自动机（Timed-LDGBA），并与底层决策过程同步，以构建适用于Q学习的乘积时序模型。一个简单而富有表达力的奖励结构在强制执行时序正确性的同时，允许纳入额外的性能目标。该方法在三个仿真研究中得到验证：一个表述为MDP的$5 \times 5$网格世界、一个表述为POMDP的$10 \times 10$网格世界，以及一个类似办公室的服务机器人场景。结果表明，所提出的框架能够持续学习到满足严格时间有界要求的策略，适用于随机状态转移，可扩展到更大的状态空间，并且在部分可观测环境中仍然有效，突显了其在时间关键和不确定环境下实现可靠机器人规划的潜力。