In Self-Supervised Learning (SSL), pre-training and evaluation are resource intensive. In the speech domain, current indicators of the quality of SSL models during pre-training, such as the loss, do not correlate well with downstream performance. Consequently, it is often difficult to gauge the final downstream performance in a cost efficient manner during pre-training. In this work, we propose unsupervised efficient methods that give insights into the quality of the pre-training of SSL speech models, namely, measuring the cluster quality and rank of the embeddings of the SSL model. Results show that measures of cluster quality and rank correlate better with downstream performance than the pre-training loss with only one hour of unlabeled audio, reducing the need for GPU hours and labeled data in SSL model evaluation.
翻译:在自监督学习(SSL)中,预训练与评估过程均需消耗大量资源。在语音领域,当前用于衡量SSL模型预训练质量的指标(如损失函数)与下游任务性能的相关性较弱。因此,在预训练阶段往往难以通过经济高效的方式准确评估最终的下游性能。本研究提出一种无监督的高效方法,旨在深入洞察SSL语音模型预训练的质量,具体通过测量SSL模型嵌入表示的聚类质量与秩来实现。实验结果表明,仅使用一小时无标注音频数据时,聚类质量与秩的度量指标相比预训练损失函数,与下游性能展现出更强的相关性,从而显著减少SSL模型评估中对GPU计算时长与标注数据的需求。