Self-Distillation for Further Pre-training of Transformers

Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.

翻译：在大规模无标注数据上预训练大型Transformer模型，并在标注数据集上针对多种下游任务进行微调，已被证明是视觉和自然语言处理任务中行之有效的策略。然而，若预训练与微调所涉数据域存在显著差异，直接微调预训练模型可能无法达到最优效果。为解决此问题，先前多项研究提出了进一步预训练策略，即在目标无标注数据集上继续预训练模型，然后再进行微调。然而，这些方法均仅聚焦于语言模型，且我们通过实验发现，在目标无标注数据上继续预训练视觉Transformer时，该模型易出现过拟合。为突破这一局限，我们提出在进一步预训练阶段引入自蒸馏作为正则化手段。具体而言，我们首先在目标无标注数据上对初始预训练模型进行进一步预训练，并将其作为自蒸馏的教师模型。随后，我们采用相同的初始预训练模型作为学生模型，在通过掩码自编码目标优化学生模型的同时，强制其隐藏表示与教师模型保持相近。我们在图像和文本分类任务的多个基准数据集上实证验证了自蒸馏的有效性。实验结果表明，我们提出的方法优于所有相关基线方法。在理论层面，我们通过简化模型对所提方法进行分析，以理解进一步预训练中的自蒸馏如何潜在提升下游任务的性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日