Recent years have witnessed a boom in self-supervised learning (SSL) in various areas including speech processing. Speech based SSL models present promising performance in a range of speech related tasks. However, the training of SSL models is computationally expensive and a common practice is to fine-tune a released SSL model on the specific task. It is essential to use consistent front-end input during pre-training and fine-tuning. This consistency may introduce potential issues when the optimal front-end is not the same as that used in pre-training. In this paper, we propose a simple but effective front-end adapter to address this front-end discrepancy. By minimizing the distance between the outputs of different front-ends, the filterbank feature (Fbank) can be compatible with SSL models which are pre-trained with waveform. The experiment results demonstrate the effectiveness of our proposed front-end adapter on several popular SSL models for the speech recognition task.
翻译:近年来,自监督学习(SSL)在包括语音处理在内的多个领域蓬勃发展。基于语音的自监督学习模型在一系列语音相关任务中展现出优越性能。然而,SSL模型的训练计算成本高昂,常见做法是在特定任务上对已发布的预训练SSL模型进行微调。在预训练与微调过程中,保持前端输入的一致性至关重要。但这种一致性可能引发潜在问题——当最优前端与预训练所采用的前端不相同时尤为明显。本文提出一种简单而有效的前端适配器,以解决这一前端差异问题。通过最小化不同前端输出之间的距离,滤波器组特征(Fbank)能够与以波形预训练的SSL模型兼容。实验结果表明,我们提出的前端适配器在多个主流SSL模型上对语音识别任务具有显著有效性。