The transfer learning paradigm of model pre-training and subsequent fine-tuning produces high-accuracy models. While most studies recommend scaling the pre-training size to benefit most from transfer learning, a question remains: what data and method should be used for pre-training? We investigate the impact of pre-training data distribution on the few-shot and full fine-tuning performance using 3 pre-training methods (supervised, contrastive language-image and image-image), 7 pre-training datasets, and 9 downstream datasets. Through extensive controlled experiments, we find that the choice of the pre-training data source is essential for the few-shot transfer, but its role decreases as more data is made available for fine-tuning. Additionally, we explore the role of data curation and examine the trade-offs between label noise and the size of the pre-training dataset. We find that using 2000X more pre-training data from LAION can match the performance of supervised ImageNet pre-training. Furthermore, we investigate the effect of pre-training methods, comparing language-image contrastive vs. image-image contrastive, and find that the latter leads to better downstream accuracy
翻译:迁移学习范式中的模型预训练及后续微调可产生高精度模型。尽管多数研究建议扩大预训练规模以最大化迁移学习收益,仍存在一个关键问题:应使用何种数据与方法进行预训练?我们通过采用3种预训练方法(监督学习、对比语言-图像及图像-图像对比)、7个预训练数据集和9个下游数据集,系统研究了预训练数据分布对少样本及全量微调性能的影响。基于大量受控实验发现:预训练数据源的选择对少样本迁移至关重要,但随着微调可用数据的增加其作用逐渐减弱。此外,我们探究了数据筛选的作用,并分析了标签噪声与预训练数据集规模之间的权衡关系。研究表明,使用来自LAION的2000倍于原始规模的预训练数据可达到与监督式ImageNet预训练相当的性能。进一步对比语言-图像对比与图像-图像对比两种预训练方法后发现,后者可带来更优的下游任务准确率。