While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. All experiments were conducted using public datasets and the implementation will be made available for reproducibility.
翻译:尽管神经文本语音合成(TTS)已实现类人自然合成语音,但由于需要配对文本与录音室级音频数据,多语言TTS系统仍局限于资源丰富的语言。本文提出了一种利用目标语言纯文本数据实现零样本多语言TTS的方法。通过使用纯文本数据,可仅为拥有文本资源的低资源语言开发TTS系统,使数千种语言均能受益于语音合成技术。受多语言语言模型强大跨语言迁移能力的启发,本框架首先使用多语言纯文本数据进行掩码语言模型预训练,随后在冻结语言感知嵌入层的同时,以监督方式在配对数据上训练该模型。这使得即使对于未包含在配对数据但存在于纯文本数据中的语言也能进行推理。评估结果表明,对于未见语言,系统可实现高度可懂的零样本TTS,字符错误率低于12%。所有实验均使用公开数据集进行,且实现代码将开源以确保可复现性。