Learning Minimally-Violating Continuous Control for Infeasible Linear Temporal Logic Specifications

This paper explores continuous-time control synthesis for target-driven navigation to satisfy complex high-level tasks expressed as linear temporal logic (LTL). We propose a model-free framework using deep reinforcement learning (DRL) where the underlying dynamic system is unknown (an opaque box). Unlike prior work, this paper considers scenarios where the given LTL specification might be infeasible and therefore cannot be accomplished globally. Instead of modifying the given LTL formula, we provide a general DRL-based approach to satisfy it with minimal violation. To do this, we transform a previously multi-objective DRL problem, which requires simultaneous automata satisfaction and minimum violation cost, into a single objective. By guiding the DRL agent with a sampling-based path planning algorithm for the potentially infeasible LTL task, the proposed approach mitigates the myopic tendencies of DRL, which are often an issue when learning general LTL tasks that can have long or infinite horizons. This is achieved by decomposing an infeasible LTL formula into several reach-avoid sub-tasks with shorter horizons, which can be trained in a modular DRL architecture. Furthermore, we overcome the challenge of the exploration process for DRL in complex and cluttered environments by using path planners to design rewards that are dense in the configuration space. The benefits of the presented approach are demonstrated through testing on various complex nonlinear systems and compared with state-of-the-art baselines. The Video demonstration can be found here:https://youtu.be/jBhx6Nv224E.

翻译：本文探讨了在目标驱动导航中实现复杂高层任务（以线性时序逻辑LTL表达）的连续时间控制综合问题。我们提出了一种基于深度强化学习（DRL）的无模型框架，其中底层动态系统未知（黑箱模型）。与先前工作不同，本文考虑了给定LTL规范可能不可满足、因此无法全局实现的情形。我们并未修改原始LTL公式，而是提供了一种通用的DRL方法，在最小化违规的前提下满足该规范。为此，我们将原先需要同时满足自动机约束与最小违规代价的多目标DRL问题转化为单目标优化问题。通过利用基于采样的路径规划算法为潜在不可满足的LTL任务引导DRL智能体，所提方法缓解了DRL的短视倾向——这一问题在学习具有长时域或无限时域的一般性LTL任务时尤为突出。具体实现上，我们将不可满足的LTL公式分解为若干时域更短的可达-避障子任务，并采用模块化DRL架构进行训练。此外，针对复杂拥挤环境中的探索难题，我们利用路径规划器设计配置空间稠密奖励函数来增强探索。通过多种复杂非线性系统的测试并与最先进基线方法对比，验证了本方法的优势。视频演示见：https://youtu.be/jBhx6Nv224E。