Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning because they are able to extract high-quality salient features from input data. As such, they have been frequently used as base networks for various pattern classification tasks such as speech recognition. However, not much research has been conducted on applying these types of models to the field of speech signal generation. In this paper, we investigate the feasibility of using pre-trained speech representation models for a downstream speech enhancement task. To alleviate mismatches between the input features of the pre-trained model and the target enhancement model, we adopt a novel feature normalization technique to smoothly link these modules together. Our proposed method enables significant improvements in speech quality compared to baselines when combined with various types of pre-trained speech models.
翻译:利用自监督学习训练的大规模预训练表征模型因其能从输入数据中提取高质量显著特征,已在机器学习各领域广受欢迎。这类模型常被用作语音识别等模式分类任务的基网络。然而,将其应用于语音信号生成领域的研究尚不充分。本文探讨了将预训练语音表征模型用于下游语音增强任务的可行性。为缓解预训练模型输入特征与目标增强模型之间的不匹配问题,我们采用了一种新颖的特征归一化技术,以平滑衔接各模块。实验表明,与传统基线相比,所提方法结合多种预训练语音模型时能显著提升语音质量。