Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: https://github.com/wabbajack1/TAPD.
翻译:通用学习系统发展的核心在于,当新数据到达时,能够在不从头开始重新训练的情况下解决多个任务。这一点至关重要,因为每个任务都需要大量的训练时间。由于问题空间的复杂性,解决持续学习问题需要多种方法。该问题空间包括:(1) 解决灾难性遗忘以保留先前学习的任务,(2) 展示正向前向迁移以实现更快学习,(3) 确保跨众多任务的可扩展性,以及(4) 促进无需任务标签的学习,即使在缺乏明确任务边界的情况下。本文介绍了任务无关策略蒸馏框架。该框架通过引入一个任务无关阶段来缓解问题(1)-(4),在此阶段中,智能体在没有任何外部目标的情况下探索其环境,并仅最大化其内在动机。在此阶段获得的知识随后被蒸馏以用于进一步探索。因此,智能体通过系统地寻找新状态以自监督的方式行动。通过利用任务无关的蒸馏知识,智能体能够更高效地解决下游任务,从而提高样本效率。我们的代码可在以下代码库获取:https://github.com/wabbajack1/TAPD。