Learning Minimally-Violating Continuous Control for Infeasible Linear Temporal Logic Specifications

This paper explores continuous-time control synthesis for target-driven navigation to satisfy complex high-level tasks expressed as linear temporal logic (LTL). We propose a model-free framework using deep reinforcement learning (DRL) where the underlying dynamic system is unknown (an opaque box). Unlike prior work, this paper considers scenarios where the given LTL specification might be infeasible and therefore cannot be accomplished globally. Instead of modifying the given LTL formula, we provide a general DRL-based approach to satisfy it with minimal violation. To do this, we transform a previously multi-objective DRL problem, which requires simultaneous automata satisfaction and minimum violation cost, into a single objective. By guiding the DRL agent with a sampling-based path planning algorithm for the potentially infeasible LTL task, the proposed approach mitigates the myopic tendencies of DRL, which are often an issue when learning general LTL tasks that can have long or infinite horizons. This is achieved by decomposing an infeasible LTL formula into several reach-avoid sub-tasks with shorter horizons, which can be trained in a modular DRL architecture. Furthermore, we overcome the challenge of the exploration process for DRL in complex and cluttered environments by using path planners to design rewards that are dense in the configuration space. The benefits of the presented approach are demonstrated through testing on various complex nonlinear systems and compared with state-of-the-art baselines. The Video demonstration can be found here:https://youtu.be/jBhx6Nv224E.

翻译：本文探讨了连续时间控制综合在目标驱动导航中的应用，以实现线性时序逻辑（LTL）所表达的复杂高层任务。我们提出了一种基于深度强化学习（DRL）的无模型框架，其中底层动态系统是未知的（黑箱模型）。与已有工作不同，本文考虑给定LTL规范可能不可满足（即无法全局实现）的场景。我们并未修改给定的LTL公式，而是提出了一种通用的基于DRL的方法，以最小化违反程度满足该规范。为此，我们将原本需要同时满足自动机状态和最小化违反代价的多目标DRL问题转化为单目标问题。通过采用基于采样的路径规划算法引导DRL代理处理潜在不可满足的LTL任务，所提方法缓解了DRL的短视倾向——这在学习具有长时或无限时域的一般LTL任务时常出现。具体而言，我们将不可满足的LTL公式分解为若干具有较短时域的到达-避障子任务，这些子任务可在模块化DRL架构中训练。此外，我们利用路径规划器在配置空间中设计稠密奖励，克服了DRL在复杂杂乱环境中探索过程的挑战。通过在多种复杂非线性系统上的测试并与当前最先进基线方法对比，验证了所提方法的优势。视频演示见：https://youtu.be/jBhx6Nv224E。