Self-Distillation for Further Pre-training of Transformers

Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.

翻译：在大规模无标签数据上预训练大型Transformer模型，并在标注数据集上微调以适应多种下游任务，已被证明是视觉和自然语言处理任务中的成功策略。然而，若预训练与微调所在数据领域存在较大差异，直接微调预训练模型可能并非最优。为解决这一问题，先前多项研究提出了进一步预训练策略，即在微调前继续在目标无标签数据集上预训练模型。但这些方法均仅关注语言模型，而我们通过实验发现，当持续在目标无标签数据上预训练时，视觉Transformer容易出现过拟合。为克服此局限，我们提出将自蒸馏作为进一步预训练阶段的正则化方法。具体而言，我们首先在目标无标签数据上进一步预训练初始预训练模型，并将其作为自蒸馏的教师模型；随后将相同的初始预训练模型作为学生模型，在通过掩码自编码目标优化学生模型的同时，使其隐藏表示与教师模型相近。我们在图像与文本分类任务的多个基准数据集上验证了自蒸馏方法的有效性。实验表明，我们的方法优于所有相关基线方法。在理论层面，我们通过简化模型分析所提方法，以理解用于进一步预训练的自蒸馏为何能潜在地提升下游任务性能。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【西湖大学】图预训练方法体系综述，A Survey of Pre-training on Graphs: Taxonomy, Methods and Applications

专知会员服务

43+阅读 · 2022年3月25日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

【EMNLP2020】低资源域适应的多阶段预训练

专知会员服务

19+阅读 · 2020年10月13日

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日