To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments.
翻译:为简化生成过程,一些文本到语音(TTS)系统隐式学习中间潜在表示,而非依赖预定义特征(如梅尔频谱图)。然而,由于这些表示缺乏语音可变性,其生成质量不尽如人意。本文通过向潜在表示中添加韵律嵌入来提升TTS性能。训练期间,我们从梅尔频谱图中提取参考韵律嵌入;推理期间,则使用生成对抗网络(GANs)从文本中估计这些嵌入。利用GANs,我们能够快速可靠地估计具有复杂分布(源于语音的动态特性)的韵律嵌入。我们还表明,韵律嵌入可作为学习文本与声学特征之间鲁棒对齐的有效特征。在对比实验中,我们提出的模型以更少的参数和计算复杂度超越了多个公开可用模型。