Motivated by the efficiency and rapid convergence of pre-trained models for solving downstream tasks, this paper extensively studies the impact of Continual Learning (CL) models as pre-trainers. In both supervised and unsupervised CL, we find that the transfer quality of the representation often increases gradually without noticeable degradation in fine-tuning performance. This is because CL models can learn improved task-general features when easily forgetting task-specific knowledge. Based on this observation, we suggest a new unsupervised CL framework with masked modeling, which aims to capture fluent task-generic representation during training. Furthermore, we propose a new fine-tuning scheme, GLobal Attention Discretization (GLAD), that preserves rich task-generic representation during solving downstream tasks. The model fine-tuned with GLAD achieves competitive performance and can also be used as a good pre-trained model itself. We believe this paper breaks the barriers between pre-training and fine-tuning steps and leads to a sustainable learning framework in which the continual learner incrementally improves model generalization, yielding better transfer to unseen tasks.
翻译:基于预训练模型在下游任务中高效且快速收敛的特点,本文深入研究了持续学习模型作为预训练器的作用。在监督与无监督持续学习两种场景下,我们发现表征的迁移质量往往逐步提升,而微调性能无明显退化。这是因为持续学习模型在易于遗忘任务特定知识时,能够学习到改进的任务通用特征。基于这一观察,我们提出了一种结合掩码建模的新型无监督持续学习框架,旨在训练过程中捕获流畅的任务通用表征。此外,我们提出了一种新的微调方案——全局注意力离散化,该方案在解决下游任务时能保留丰富的任务通用表征。经GLAD微调后的模型不仅达到了竞争性性能,其自身也可作为优质预训练模型使用。我们相信本文打破了预训练与微调步骤间的壁垒,构建了一种可持续学习框架——持续学习器通过增量式提升模型泛化能力,实现了对未见任务的更优迁移。