This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.
翻译:本文提出Virtuoso,一种用于文本到语音合成(TTS)模型的大规模多语言语音-文本联合半监督学习框架。现有多语言TTS通常支持数十种语言,仅占全球数千种语言中的一小部分。将多语言TTS扩展至数百种语言的主要困难之一在于为低资源语言收集高质量的语音-文本配对数据。本研究将面向自动语音识别(ASR)的语音-文本联合预训练框架Maestro扩展至语音生成任务。为利用多种类型的语音和文本数据训练TTS模型,我们设计了不同训练方案处理监督数据(配对的TTS与ASR数据)与无监督数据(未转写语音与未朗读文本)。实验评估表明:1)基于Virtuoso训练的多语言TTS模型在已知语言上的自然度和可懂度显著优于基线模型;2)即使在缺乏高质量配对TTS数据的未知语言上,该模型也能合成具有合理可懂度和自然听感的语音。