Many neural text-to-speech architectures can synthesize nearly natural speech from text inputs. These architectures must be trained with tens of hours of annotated and high-quality speech data. Compiling such large databases for every new voice requires a lot of time and effort. In this paper, we describe a method to extend the popular Tacotron-2 architecture and its training with data augmentation to enable single-speaker synthesis using a limited amount of specific training data. In contrast to elaborate augmentation methods proposed in the literature, we use simple stationary noises for data augmentation. Our extension is easy to implement and adds almost no computational overhead during training and inference. Using only two hours of training data, our approach was rated by human listeners to be on par with the baseline Tacotron-2 trained with 23.5 hours of LJSpeech data. In addition, we tested our model with a semantically unpredictable sentences test, which showed that both models exhibit similar intelligibility levels.
翻译:许多神经文本转语音架构能够从文本输入合成接近自然的语音。这些架构需要使用数十小时的高质量标注语音数据进行训练。为每个新声音构建如此庞大的数据库需要耗费大量时间和精力。本文描述了一种扩展流行Tacotron-2架构及其训练的方法,通过数据增强实现使用有限特定训练数据的单说话人合成。与文献中提出的复杂增强方法不同,我们采用简单的静态噪声进行数据增强。我们的扩展易于实现,且在训练和推理期间几乎不增加计算开销。仅使用两小时训练数据,我们的方法被人类听评者认为与使用23.5小时LJSpeech数据训练的基线Tacotron-2性能相当。此外,我们通过语义不可预测句测试验证了模型,结果显示两个模型展现出相似的可理解性水平。