The commonly used Reinforcement Learning (RL) model, MDPs (Markov Decision Processes), has a basic premise that rewards depend on the current state and action only. However, many real-world tasks are non-Markovian, which has long-term memory and dependency. The reward sparseness problem is further amplified in non-Markovian scenarios. Hence learning a non-Markovian task (NMT) is inherently more difficult than learning a Markovian one. In this paper, we propose a novel \textbf{Par}allel and \textbf{Mod}ular RL framework, ParMod, specifically for learning NMTs specified by temporal logic. With the aid of formal techniques, the NMT is modulaized into a series of sub-tasks based on the automaton structure (equivalent to its temporal logic counterpart). On this basis, sub-tasks will be trained by a group of agents in a parallel fashion, with one agent handling one sub-task. Besides parallel training, the core of ParMod lies in: a flexible classification method for modularizing the NMT, and an effective reward shaping method for improving the sample efficiency. A comprehensive evaluation is conducted on several challenging benchmark problems with respect to various metrics. The experimental results show that ParMod achieves superior performance over other relevant studies. Our work thus provides a good synergy among RL, NMT and temporal logic.
翻译:常用的强化学习模型——马尔可夫决策过程,其基本前提是奖励仅取决于当前状态和动作。然而,许多现实世界任务是非马尔可夫性的,具有长期记忆和依赖性。奖励稀疏性问题在非马尔可夫场景中被进一步放大。因此,学习非马尔可夫任务本质上比学习马尔可夫任务更为困难。本文提出了一种新颖的**并行**且**模块化**的强化学习框架 ParMod,专门用于学习由时序逻辑指定的非马尔可夫任务。借助形式化技术,非马尔可夫任务基于其自动机结构(等价于对应的时序逻辑公式)被模块化为一系列子任务。在此基础上,子任务将由一组智能体以并行方式进行训练,每个智能体处理一个子任务。除了并行训练,ParMod 的核心在于:一种灵活的用于模块化非马尔可夫任务的分类方法,以及一种用于提高样本效率的有效奖励塑形方法。我们在多个具有挑战性的基准问题上,针对不同指标进行了全面评估。实验结果表明,ParMod 的性能优于其他相关研究。因此,我们的工作在强化学习、非马尔可夫任务和时序逻辑之间实现了良好的协同。