The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise. In this paper, we propose a noise-robust zero-shot TTS method. We incorporated adapters into the SSL model, which we fine-tuned with the TTS model using noisy reference speech. In addition, to further improve performance, we adopted a speech enhancement (SE) front-end. With these improvements, our proposed SSL-based zero-shot TTS achieved high-quality speech synthesis with noisy reference speech. Through the objective and subjective evaluations, we confirmed that the proposed method is highly robust to noise in reference speech, and effectively works in combination with SE.
翻译:零样本文本转语音(TTS)方法通过自监督学习(SSL)语音表征从参考语音中提取说话人嵌入,可精准复现说话人特征。然而当参考语音含有噪声时,该方法的语音合成质量会出现退化。本文提出一种抗噪声的零样本TTS方法。我们在SSL模型中引入适配器,并使用含噪参考语音对适配器与TTS模型进行联合微调。此外,为提升性能,我们采用语音增强(SE)前端模块。经此改进,所提出的基于SSL的零样本TTS方法在参考语音含噪情况下仍能实现高质量语音合成。通过主客观评估,我们证实该方法对参考语音噪声具有强鲁棒性,且能与SE有效协同工作。