Training reinforcement learning (RL) agents using scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Moreover, handcrafting these reward functions before training is prone to misspecification, especially when the environment's dynamics are only partially known. This paper proposes a novel pipeline for learning non-Markovian task specifications as succinct finite-state `task automata' from episodes of agent experience within unknown environments. We leverage two key algorithmic insights. First, we learn a product MDP, a model composed of the specification's automaton and the environment's MDP (both initially unknown), by treating the product MDP as a partially observable MDP and using the well-known Baum-Welch algorithm for learning hidden Markov models. Second, we propose a novel method for distilling the task automaton (assumed to be a deterministic finite automaton) from the learnt product MDP. Our learnt task automaton enables the decomposition of a task into its constituent sub-tasks, which improves the rate at which an RL agent can later synthesise an optimal policy. It also provides an interpretable encoding of high-level environmental and task features, so a human can readily verify that the agent has learnt coherent tasks with no misspecifications. In addition, we take steps towards ensuring that the learnt automaton is environment-agnostic, making it well-suited for use in transfer learning. Finally, we provide experimental results compared with two baselines to illustrate our algorithm's performance in different environments and tasks.
翻译:使用标量奖励信号训练强化学习智能体在环境奖励稀疏且非马尔可夫时往往不可行。此外,在训练前手工构建奖励函数容易产生错误规范,特别是当环境动态仅部分已知时。本文提出了一种新颖的流水线方法,用于从未知环境中智能体的经验片段中学习简洁的有限状态"任务自动机",以表达非马尔可夫任务规范。我们利用两个关键算法洞见。首先,通过将乘积MDP视为部分可观测MDP,并采用著名的Baum-Welch算法学习隐马尔可夫模型,我们学习了一个乘积MDP——由规范自动机与环境MDP(两者初始均为未知)构成的组合模型。其次,我们提出了一种从学习到的乘积MDP中提炼任务自动机(假设为确定性有限自动机)的新方法。学习到的任务自动机能够将任务分解为若干子任务,从而提升强化学习智能体后续合成最优策略的效率。该自动机还提供了高层环境与任务特征的可解释编码,使人类能够轻松验证智能体是否学习了无错误规范的连贯任务。此外,我们采取措施确保学习到的自动机与环境无关,使其适用于迁移学习。最后,通过与两个基线方法的实验对比,我们展示了算法在不同环境与任务中的性能表现。