A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity. Audio examples from our system are available online.
翻译:旨在根据文本重建语音的文本转语音(TTS)模型,其预测结果往往倾向于接近数据集的平均特征,无法模拟使人类语音听起来自然的各类变化。对于零样本语音克隆这一任务而言,该问题被进一步放大,因为该任务需要具有高说话风格差异的训练数据。我们基于近期利用生成对抗网络(GAN)的研究工作,提出了一种Transformer编码器-解码器架构,用于有条件地区分真实与生成的语音特征。该判别器被用于一个训练流程中,以同时改进TTS模型的声学特征和韵律特征。我们通过将这种新颖的对抗训练技术应用于FastSpeech2声学模型,并在大型多说话人数据集Libriheavy上进行零样本语音克隆任务的训练,从而引入了该方法。我们的模型在语音质量和说话人相似度方面均取得了优于基线的改进结果。我们系统的音频示例可在网上获取。