Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.
翻译:终身学习旨在创建能够像生物学习一样在生命周期中持续且增量式学习的 AI 系统。迄今的尝试已遇到诸多问题,包括灾难性遗忘、任务间干扰以及无法利用先前知识。尽管大量研究聚焦于学习多种输入分布(通常在分类任务中),但终身强化学习还需应对状态分布、转移分布以及奖励函数的变化。最近为分类任务开发的调制掩码特别适合处理如此广泛的任务变化谱系。本文中,我们将调制掩码适配至深度终身强化学习场景,具体应用于 PPO 和 IMPALA 智能体。在离散与连续强化学习任务中与终身强化学习基线方法进行对比,结果表明其性能更优。我们进一步研究了利用先前学习掩码的线性组合来在新任务中复用先前知识:不仅学习速度更快,该算法还能解决因极稀疏奖励而无法从零开始学习的任务。结果表明,结合调制掩码的强化学习是迈向终身学习、通过知识组合学习日益复杂任务、以及高效快速学习知识复用的一种有前景的方法。