Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular for improved prediction and efficient use of limited resources. Fine-tuning requires identification of best models to transfer-learn from and quantifying transferability prevents expensive re-training on all of the candidate models/tasks pairs. In this paper, we show that the statistical problems with covariance estimation drive the poor performance of H-score -- a common baseline for newer metrics -- and propose shrinkage-based estimator. This results in up to 80% absolute gain in H-score correlation performance, making it competitive with the state-of-the-art LogME measure. Our shrinkage-based H-score is $3\times$-10$\times$ faster to compute compared to LogME. Additionally, we look into a less common setting of target (as opposed to source) task selection. We demonstrate previously overlooked problems in such settings with different number of labels, class-imbalance ratios etc. for some recent metrics e.g., NCE, LEEP that resulted in them being misrepresented as leading measures. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings. We support our findings with ~164,000 (fine-tuning trials) experiments on both vision models and graph neural networks.
翻译:在小型定制数据集上微调大规模预训练图像与语言模型,已成为提升预测性能并高效利用有限资源的流行方法。微调需要识别最适合迁移学习的模型,而量化可迁移性可避免对所有候选模型/任务对进行昂贵的重新训练。本文证明,协方差估计中的统计问题导致H-score(一种用于较新度量的常见基线)性能较差,并提出基于收缩估计的替代方案。这将H-score的相关系数性能提升高达80%,使其与最先进的LogME度量具有竞争力。与LogME相比,我们的基于收缩估计的H-score计算速度快3倍至10倍。此外,我们探讨了较少涉及的目标任务选择场景(与源任务选择相对)。我们揭示了在此类场景中,针对不同标签数量、类别不平衡比例等条件,某些近期度量(如NCE、LEEP)存在的先前被忽视的问题,导致其被误判为领先度量。我们提出修正方案,并建议在此类场景中测量与相对准确度相关的相关系数性能。我们通过在视觉模型和图神经网络上进行的约164,000次(微调试验)实验支持了上述发现。