This paper addresses the problem of learning optimal control policies for systems with uncertain dynamics and high-level control objectives specified as Linear Temporal Logic (LTL) formulas. Uncertainty is considered in the workspace structure and the outcomes of control decisions giving rise to an unknown Markov Decision Process (MDP). Existing reinforcement learning (RL) algorithms for LTL tasks typically rely on exploring a product MDP state-space uniformly (using e.g., an $\epsilon$-greedy policy) compromising sample-efficiency. This issue becomes more pronounced as the rewards get sparser and the MDP size or the task complexity increase. In this paper, we propose an accelerated RL algorithm that can learn control policies significantly faster than competitive approaches. Its sample-efficiency relies on a novel task-driven exploration strategy that biases exploration towards directions that may contribute to task satisfaction. We provide theoretical analysis and extensive comparative experiments demonstrating the sample-efficiency of the proposed method. The benefit of our method becomes more evident as the task complexity or the MDP size increases.
翻译:本文研究了在系统动力学不确定且高层控制目标以线性时序逻辑(LTL)公式指定的情况下,学习最优控制策略的问题。不确定性体现在工作空间结构以及控制决策的结果上,从而产生一个未知的马尔可夫决策过程(MDP)。现有的针对LTL任务的强化学习(RL)算法通常依赖于均匀探索乘积MDP状态空间(例如使用$\epsilon$-贪心策略),这牺牲了样本效率。随着奖励变得稀疏以及MDP规模或任务复杂度的增加,这一问题变得更为突出。本文提出了一种加速RL算法,其学习控制策略的速度显著快于现有竞争方法。其样本效率依赖于一种新颖的任务驱动探索策略,该策略将探索偏向于可能有助于任务满足的方向。我们提供了理论分析和广泛的对比实验,证明了所提方法的样本效率。随着任务复杂度或MDP规模的增加,我们方法的优势变得更加明显。