We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task. Our XPhoneBERT has the same model architecture as BERT-base, trained using the RoBERTa pre-training approach on 330M phoneme-level sentences from nearly 100 languages and locales. Experimental results show that employing XPhoneBERT as an input phoneme encoder significantly boosts the performance of a strong neural TTS model in terms of naturalness and prosody and also helps produce fairly high-quality speech with limited training data. We publicly release our pre-trained XPhoneBERT with the hope that it would facilitate future research and downstream TTS applications for multiple languages. Our XPhoneBERT model is available at https://github.com/VinAIResearch/XPhoneBERT
翻译:本文提出XPhoneBERT,这是首个为下游文本转语音(TTS)任务而预训练以学习音素表示的多语言模型。XPhoneBERT采用与BERT-base相同的模型架构,通过RoBERTa预训练方法,在来自近100种语言及区域的3.3亿个音素级句子上进行训练。实验结果表明,将XPhoneBERT作为输入音素编码器,可显著提升强神经网络TTS模型在自然度和韵律方面的性能,并能在训练数据有限的情况下生成质量较高的语音。我们公开发布预训练的XPhoneBERT模型,期望它能促进多语言的未来研究和下游TTS应用。XPhoneBERT模型可在https://github.com/VinAIResearch/XPhoneBERT获取。