Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. In most cases, TTS systems are built using a single speaker's voice. However, there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper is the first to incorporate the representations from text-based and speech-based self-supervised learning models into multilingual speech synthesis tasks. We conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has been proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetical low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.
翻译:神经文本转语音(TTS)在单说话人、单语言合成中已实现类人级别的语音生成。多语言TTS系统因缺乏大规模配对文本和录音室级音频数据而局限于资源丰富语言。多数情况下,TTS系统使用单一说话人声音构建,但开发仅需几秒目标语音即可合成新说话人声音的系统正日益受到关注。本文提出ZMM-TTS——一种利用大规模预训练自监督模型生成的量化潜在语音表示的多语言多说话人框架。本文首次将基于文本和语音的自监督学习模型表示融入多语言语音合成任务。通过系列实验进行的主观与客观综合评估表明,该模型在六种高资源语言的已知与未知说话人合成中均展现出优异的语音自然度和相似性。我们还在两种假设的低资源语言上测试了方法有效性,结果令人鼓舞:即使面对完全无训练数据的新语言,所提方法仍能合成具有高可懂度且与目标说话人声音高度相似的音频。