Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition (ASR) systems in low-resource settings. However, the common assumption made in literature is that a considerable amount of unlabeled data is available for the same domain or language that can be leveraged for SSL pre-training, which we acknowledge is not feasible in a real-world setting. In this paper, as part of the Interspeech Gram Vaani ASR challenge, we try to study the effect of domain, language, dataset size, and other aspects of our upstream pre-training SSL data on the final performance low-resource downstream ASR task. We also build on the continued pre-training paradigm to study the effect of prior knowledge possessed by models trained using SSL. Extensive experiments and studies reveal that the performance of ASR systems is susceptible to the data used for SSL pre-training. Their performance improves with an increase in similarity and volume of pre-training data. We believe our work will be helpful to the speech community in building better ASR systems in low-resource settings and steer research towards improving generalization in SSL-based pre-training for speech systems.
翻译:自监督学习(SSL)用于学习高级语音表示已成为在低资源环境下构建自动语音识别(ASR)系统的流行方法。然而,文献中的常见假设是可利用大量同领域或同语言的未标注数据进行SSL预训练,但我们认为这在现实场景中并不可行。本文作为Interspeech Gram Vaani ASR挑战赛的一部分,试图研究上游预训练SSL数据的领域、语言、数据集大小及其他方面对低资源下游ASR任务最终性能的影响。我们还基于持续预训练范式,研究了使用SSL训练的模型所具备的先前知识的影响。大量实验和研究揭示,ASR系统的性能易受SSL预训练所用数据的影响,其性能随着预训练数据相似度和规模的增加而提升。我们相信,这项工作将有助于语音社区在低资源环境下构建更好的ASR系统,并引导研究朝着改进基于SSL的语音系统预训练泛化能力的方向发展。