In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks. We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates using the extracted representations. Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers by making a temporal independent hypothesis. We evaluate some popular supervised speech models (e.g., Conformer RNN-Transducer) and self-supervised speech models (e.g., HuBERT) in cross-layer and cross-model settings using public data. Experimental results show a high Spearman's rank correlation and low $p$-value between our estimation framework and fine-tuning ground truth. Our proposed transferability framework requires less computational time and resources, making it a resource-saving and time-efficient approach for tuning speech foundation models.
翻译:在这项工作中,我们提出了一种“基于分数的评估”框架,用于估计预训练语音模型(PSMs)在微调目标任务中的迁移性。我们利用两种表示理论——贝叶斯似然估计和最优传输——通过提取的表示为候选PSM生成排名分数。我们的框架通过提出时间独立性假设,高效地计算迁移性分数,无需实际微调候选模型或层。我们使用公开数据,在跨层和跨模型设置下评估了一些流行的监督语音模型(如Conformer RNN-Transducer)和自监督语音模型(如HuBERT)。实验结果表明,我们的估计框架与微调真实值之间具有较高的斯皮尔曼等级相关系数和较低的$p$值。我们提出的迁移性框架所需计算时间和资源较少,是一种节省资源且时间高效的语音基础模型调优方法。