Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.
翻译:在大规模无标注数据上预训练大型Transformer模型,并在标注数据集上针对多种下游任务进行微调,已被证明是视觉和自然语言处理任务中行之有效的策略。然而,若预训练与微调所涉数据域存在显著差异,直接微调预训练模型可能无法达到最优效果。为解决此问题,先前多项研究提出了进一步预训练策略,即在目标无标注数据集上继续预训练模型,然后再进行微调。然而,这些方法均仅聚焦于语言模型,且我们通过实验发现,在目标无标注数据上继续预训练视觉Transformer时,该模型易出现过拟合。为突破这一局限,我们提出在进一步预训练阶段引入自蒸馏作为正则化手段。具体而言,我们首先在目标无标注数据上对初始预训练模型进行进一步预训练,并将其作为自蒸馏的教师模型。随后,我们采用相同的初始预训练模型作为学生模型,在通过掩码自编码目标优化学生模型的同时,强制其隐藏表示与教师模型保持相近。我们在图像和文本分类任务的多个基准数据集上实证验证了自蒸馏的有效性。实验结果表明,我们提出的方法优于所有相关基线方法。在理论层面,我们通过简化模型对所提方法进行分析,以理解进一步预训练中的自蒸馏如何潜在提升下游任务的性能。