This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC) to be performed by the same model due to the inherent linguistic content and speaker identity disentanglement. When used in a cross-lingual setting, acoustic features are initially produced with a native speaker of the target language and then voice conversion is applied by the same model in order to convert these features to the target speaker's voice. We verify through objective and subjective evaluations that our method can have benefits compared to baseline cross-lingual synthesis. By including speakers averaging 7.5 minutes of speech, we also present positive results on low-resource scenarios.
翻译:本文提出一种端到端跨语言文本转语音(TTS)方法,旨在保留目标语言的发音特征,不受原始说话者语言的影响。该模型基于非注意力机制的Tacotron架构,其中解码器被替换为以说话者身份为条件的归一化流网络,由于内在的语言内容与说话者身份的解耦,同一模型可同时执行TTS和语音转换(VC)。在跨语言场景中,首先利用目标语言的母语说话者生成声学特征,随后同一模型应用语音转换,将这些特征转换为目标说话者的声音。通过客观与主观评估,我们验证了该方法相较于基线跨语言合成具有优势。此外,在低资源场景下(每个说话者平均包含7.5分钟语音数据),我们也获得了积极结果。