This paper presents a hierarchical reinforcement learning algorithm constrained by differentiable signal temporal logic. Previous work on logic-constrained reinforcement learning consider encoding these constraints with a reward function, constraining policy updates with a sample-based policy gradient. However, such techniques oftentimes tend to be inefficient because of the significant number of samples required to obtain accurate policy gradients. In this paper, instead of implicitly constraining policy search with sample-based policy gradients, we directly constrain policy search by backpropagating through formal constraints, enabling training hierarchical policies with substantially fewer training samples. The use of hierarchical policies is recognized as a crucial component of reinforcement learning with task constraints. We show that we can stably constrain policy updates, thus enabling different levels of the policy to be learned simultaneously, yielding superior performance compared with training them separately. Experiment results on several simulated high-dimensional robot dynamics and a real-world differential drive robot (TurtleBot3) demonstrate the effectiveness of our approach on five different types of task constraints. Demo videos, code, and models can be found at our project website: https://sites.google.com/view/dscrl
翻译:本文提出一种受可微信号时序逻辑约束的分层强化学习算法。先前关于逻辑约束强化学习的工作通过奖励函数编码这些约束,并利用基于样本的策略梯度约束策略更新。然而,这类技术往往因获取精确策略梯度所需的大量样本而效率低下。本文不采用基于样本的策略梯度隐式约束策略搜索,而是通过直接反向传播形式约束来约束策略搜索,从而能够以显著更少的训练样本训练分层策略。分层策略的使用被视为带任务约束的强化学习的关键组成部分。我们证明可以稳定地约束策略更新,使得策略的不同层级能够同时学习,与分别训练相比展现出更优性能。在多个高维机器人动力学仿真环境及真实差速驱动机器人(TurtleBot3)上的实验结果表明,本方法在五种不同类型的任务约束下均有效。演示视频、代码及模型可访问项目网站:https://sites.google.com/view/dscrl