Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. All experiments were conducted using public datasets and the implementation will be made available for reproducibility.

翻译：尽管神经文本语音合成（TTS）已实现类人自然合成语音，但由于需要配对文本与录音室级音频数据，多语言TTS系统仍局限于资源丰富的语言。本文提出了一种利用目标语言纯文本数据实现零样本多语言TTS的方法。通过使用纯文本数据，可仅为拥有文本资源的低资源语言开发TTS系统，使数千种语言均能受益于语音合成技术。受多语言语言模型强大跨语言迁移能力的启发，本框架首先使用多语言纯文本数据进行掩码语言模型预训练，随后在冻结语言感知嵌入层的同时，以监督方式在配对数据上训练该模型。这使得即使对于未包含在配对数据但存在于纯文本数据中的语言也能进行推理。评估结果表明，对于未见语言，系统可实现高度可懂的零样本TTS，字符错误率低于12%。所有实验均使用公开数据集进行，且实现代码将开源以确保可复现性。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

【如何做研究】How to research ，22页ppt

专知会员服务

114+阅读 · 2021年4月17日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日