Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech. Meanwhile, it was shown that self-supervised pre-trained models exhibit excellent noise robustness on many speech tasks, implying that the learned representation has a better tolerance for noise perturbations. In this work, we therefore explore pre-trained models to improve the noise robustness of TTS models. Based on HiFi-GAN, we first propose a representation-to-waveform vocoder, which aims to learn to map the representation of pre-trained models to the waveform. We then propose a text-to-representation FastSpeech2 model, which aims to learn to map text to pre-trained model representations. Experimental results on the LJSpeech and LibriTTS datasets show that our method outperforms those using speech enhancement methods in both subjective and objective metrics. Audio samples are available at: https://zqs01.github.io/rep2wav.
翻译:受益于深度学习的发展,使用纯净语音的文本转语音技术取得了显著的性能提升。然而,实际场景采集的数据常包含噪声,通常需要经由语音增强模型进行去噪处理。噪声鲁棒TTS模型常采用增强后的语音进行训练,这使得合成语音质量受到语音失真和背景噪声的影响。同时,研究表明自监督预训练模型在多项语音任务中展现出优异的噪声鲁棒性,意味着其学得的表示对噪声扰动具有更好的容忍度。为此,本文探索利用预训练模型提升TTS模型的噪声鲁棒性。我们基于HiFi-GAN首先提出一种表示到波形的声码器,旨在学习将预训练模型的表示映射为波形;随后提出文本到表示FastSpeech2模型,旨在学习将文本映射为预训练模型的表示。在LJSpeech和LibriTTS数据集上的实验结果表明,无论主观还是客观指标,本方法均优于采用语音增强方法的效果。音频样本可访问:https://zqs01.github.io/rep2wav