Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.
翻译:在大规模无标签数据上预训练大型Transformer模型,并在标注数据集上微调以适应多种下游任务,已被证明是视觉和自然语言处理任务中的成功策略。然而,若预训练与微调所在数据领域存在较大差异,直接微调预训练模型可能并非最优。为解决这一问题,先前多项研究提出了进一步预训练策略,即在微调前继续在目标无标签数据集上预训练模型。但这些方法均仅关注语言模型,而我们通过实验发现,当持续在目标无标签数据上预训练时,视觉Transformer容易出现过拟合。为克服此局限,我们提出将自蒸馏作为进一步预训练阶段的正则化方法。具体而言,我们首先在目标无标签数据上进一步预训练初始预训练模型,并将其作为自蒸馏的教师模型;随后将相同的初始预训练模型作为学生模型,在通过掩码自编码目标优化学生模型的同时,使其隐藏表示与教师模型相近。我们在图像与文本分类任务的多个基准数据集上验证了自蒸馏方法的有效性。实验表明,我们的方法优于所有相关基线方法。在理论层面,我们通过简化模型分析所提方法,以理解用于进一步预训练的自蒸馏为何能潜在地提升下游任务性能。