The pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task, becoming commonplace across many areas of machine learning. While pretraining is empirically observed to be beneficial for a range of tasks, there is not a clear understanding yet of the reasons for this effect. In this work, we examine the relationship between pretrained vision transformers and the corresponding finetuned versions on several benchmark datasets and tasks. We present new metrics that specifically investigate the degree to which invariances learned by a pretrained model are retained or forgotten during finetuning. Using these metrics, we present a suite of empirical findings, including that pretraining induces transferable invariances in shallow layers and that invariances from deeper pretrained layers are compressed towards shallower layers during finetuning. Together, these findings contribute to understanding some of the reasons for the successes of pretrained models and the changes that a pretrained model undergoes when finetuned on a downstream task.
翻译:预训练-微调范式通常能提升模型在下游任务上的性能,相较于从零开始训练同一任务已成为机器学习众多领域的通用做法。尽管大量实验表明预训练对各类任务有益,但其根本原因尚不明确。本研究在多个基准数据集和任务上,系统对比了预训练视觉Transformer及其对应微调版本之间的关系。我们提出新型评估指标,专门探究预训练模型所学的不变性在微调过程中被保留或遗忘的程度。基于这些指标,我们获得一系列实证发现:预训练在浅层诱导出可迁移的不变性,而深层预训练的不变性在微调过程中会向浅层压缩。这些发现共同揭示了预训练模型成功的内在原因,以及模型在下游任务微调时发生的具体变化机制。