Pre-training has been a popular learning paradigm in deep learning era, especially in annotation-insufficient scenario. Better ImageNet pre-trained models have been demonstrated, from the perspective of architecture, by previous research to have better transferability to downstream tasks. However, in this paper, we found that during the same pre-training process, models at middle epochs, which is inadequately pre-trained, can outperform fully trained models when used as feature extractors (FE), while the fine-tuning (FT) performance still grows with the source performance. This reveals that there is not a solid positive correlation between top-1 accuracy on ImageNet and the transferring result on target data. Based on the contradictory phenomenon between FE and FT that better feature extractor fails to be fine-tuned better accordingly, we conduct comprehensive analyses on features before softmax layer to provide insightful explanations. Our discoveries suggest that, during pre-training, models tend to first learn spectral components corresponding to large singular values and the residual components contribute more when fine-tuning.
翻译:预训练已成为深度学习时代一种流行的学习范式,尤其在标注样本不足的场景下。已有研究从架构角度证明了更好的ImageNet预训练模型对下游任务具有更强的迁移性。然而,本文发现:在同一预训练过程中,处于中间训练轮次(即尚未充分预训练)的模型,在用作特征提取器(FE)时,其性能可超越完全训练的模型;而微调(FT)性能仍随源模型性能提升而增长。这表明ImageNet上的Top-1准确率与目标数据上的迁移结果之间并不存在稳定的正相关关系。基于特征提取器更好但相应微调效果不佳这一矛盾现象,我们对softmax层前的特征进行了全面分析,并提供了具有洞见的解释。我们的发现表明:在预训练过程中,模型倾向于先学习与较大奇异值对应的频谱分量,而残差分量的贡献在微调时更为显著。