Reinforcement learning (RL) is a powerful tool for finding optimal policies in sequential decision processes. However, deep RL methods suffer from two weaknesses: collecting the amount of agent experience required for practical RL problems is prohibitively expensive, and the learned policies exhibit poor generalization on tasks outside of the training distribution. To mitigate these issues, we introduce automaton distillation, a form of neuro-symbolic transfer learning in which Q-value estimates from a teacher are distilled into a low-dimensional representation in the form of an automaton. We then propose two methods for generating Q-value estimates: static transfer, which reasons over an abstract Markov Decision Process constructed based on prior knowledge, and dynamic transfer, where symbolic information is extracted from a teacher Deep Q-Network (DQN). The resulting Q-value estimates from either method are used to bootstrap learning in the target environment via a modified DQN loss function. We list several failure modes of existing automaton-based transfer methods and demonstrate that both static and dynamic automaton distillation decrease the time required to find optimal policies for various decision tasks.
翻译:强化学习(Reinforcement Learning, RL)是序贯决策过程中寻找最优策略的有力工具。然而,深度强化学习方法存在两个弱点:为实际RL问题收集所需智能体经验的成本过高,且学习到的策略在训练分布之外的任务上泛化能力较差。为缓解这些问题,我们引入自动机蒸馏(Automaton Distillation),这是一种神经符号迁移学习形式,其中来自教师的Q值估计被蒸馏为自动机形式的低维表示。随后我们提出两种生成Q值估计的方法:静态迁移,即基于先验知识构建的抽象马尔可夫决策过程进行推理;以及动态迁移,从教师深度Q网络(Deep Q-Network, DQN)中提取符号信息。两种方法生成的Q值估计均通过改进的DQN损失函数用于目标环境中的自举学习。我们列举了现有基于自动机的迁移方法的若干失败模式,并证明静态与动态自动机蒸馏均能缩短在各类决策任务中寻找最优策略所需的时间。