Despite imperfect score-matching causing drift in training and sampling distributions of diffusion models, recent advances in diffusion-based acoustic models have revolutionized data-sufficient single-speaker Text-to-Speech (TTS) approaches, with Grad-TTS being a prime example. However, the sampling drift problem leads to these approaches struggling in multi-speaker scenarios in practice due to more complex target data distribution compared to single-speaker scenarios. In this paper, we present Multi-GradSpeech, a multi-speaker diffusion-based acoustic models which introduces the Consistent Diffusion Model (CDM) as a generative modeling approach. We enforce the consistency property of CDM during the training process to alleviate the sampling drift problem in the inference stage, resulting in significant improvements in multi-speaker TTS performance. Our experimental results corroborate that our proposed approach can improve the performance of different speakers involved in multi-speaker TTS compared to Grad-TTS, even outperforming the fine-tuning approach. Audio samples are available at https://welkinyang.github.io/multi-gradspeech/
翻译:摘要:尽管不完美的分数匹配会导致扩散模型在训练和采样过程中的分布漂移,但基于扩散的声学模型的最新进展已彻底改变了数据充足的单说话人文本转语音(TTS)方法,其中Grad-TTS是一个典型例子。然而,与单说话人场景相比,多说话人场景中目标数据分布更为复杂,导致采样漂移问题在实际应用中使得这些方法难以胜任。本文提出了Multi-GradSpeech,一种基于扩散的多说话人声学模型,引入一致扩散模型(CDM)作为生成建模方法。我们在训练过程中强制施加CDM的一致性特性,以缓解推理阶段的采样漂移问题,从而显著提升多说话人TTS性能。实验结果表明,与Grad-TTS相比,所提出的方法能够改善多说话人TTS中不同说话人的表现,甚至优于微调方法。音频样本可访问 https://welkinyang.github.io/multi-gradspeech/ 获取。