Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contain noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech. Meanwhile, it was shown that self-supervised pre-trained models exhibit excellent noise robustness on many speech tasks, implying that the learned representation has a better tolerance for noise perturbations. In this work, we therefore explore pre-trained models to improve the noise robustness of TTS models. Based on HIFI-GAN we first propose a representation-to-waveform vocoder, which aims to learn to map the representation of pre-trained models to the waveform. We then propose a text-to-representation Fastspeech2 model, which aims to learn to map text to pre-trained model representations. Experimental results on the LJSpeech and LibriTTS datasets show that our method outperforms those using speech enhancement methods in both subjective and objective metrics. Audio samples are available at: https://zqs01.github.io/rep2wav/.
翻译:受益于深度学习的发展,使用干净语音的文本到语音(TTS)技术已取得显著性能提升。从真实场景收集的数据往往包含噪声,通常需要通过语音增强模型进行去噪。噪声鲁棒性TTS模型通常使用增强后的语音进行训练,因此会受到语音失真和背景噪声的影响,从而降低合成语音的质量。与此同时,研究表明自监督预训练模型在许多语音任务中展现出优异的噪声鲁棒性,这表明其学习到的表示对噪声扰动具有更好的容忍度。为此,本文探索利用预训练模型来提升TTS模型的噪声鲁棒性。我们首先基于HIFI-GAN提出一种表示到波形的声码器,旨在学习将预训练模型的表示映射为波形。随后提出一种文本到表示的Fastspeech2模型,旨在学习将文本映射到预训练模型的表示。在LJSpeech和LibriTTS数据集上的实验结果表明,我们的方法在主观和客观指标上均优于使用语音增强方法的效果。音频样本参见:https://zqs01.github.io/rep2wav/。