In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.
翻译:本研究致力于构建一个统一的文本到语音合成系统,使其能够生成超过7000种语言的语音,其中许多语言缺乏传统TTS开发所需的充足数据。通过创新性地融合大规模多语言预训练与元学习来近似语言表征,我们的方法实现了在完全无可用数据语言中的零样本语音合成。我们通过客观指标和涵盖多样化语言景观的人工评估验证了系统的性能。通过公开代码与模型,我们旨在赋能语言资源有限的社群,并推动语音技术领域的进一步创新。