Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired transcribed data required to train the model for the target downstream TTS task. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones, which may allow the model to learn proper temporal assignment relation between input and output sequences. In addition, we propose a data augmentation method that further improves the data efficiency in fine-tuning. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios, achieving outstanding performance compared to competing methods. The code and audio samples are available at: https://github.com/cnaigithub/SpeechDewarping
翻译:神经文本到语音(TTS)模型在大规模转录语音数据上训练时,能够合成自然的自然人声。然而,收集此类大规模转录数据的成本高昂。本文提出了一种无监督预训练方法,通过利用大量未转录的语音数据来训练序列到序列的TTS模型。通过我们的预训练,可以显著减少训练目标下游TTS任务所需的配对转录数据量。主要思想是预训练模型从扭曲的梅尔频谱图中重建去扭曲的梅尔频谱图,这可能使模型学习到输入与输出序列之间的正确时间分配关系。此外,我们提出了一种数据增强方法,进一步提升了微调阶段的数据效率。我们通过实验证明了所提方法在低资源语言场景中的有效性,与竞争方法相比取得了卓越性能。代码和音频样本可访问:https://github.com/cnaigithub/SpeechDewarping