Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS

The diffusion model is capable of generating high-quality data through a probabilistic approach. However, it suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. To address this limitation, recent models such as denoising diffusion implicit models (DDIM) focus on generating samples without directly modeling the probability distribution, while models like denoising diffusion generative adversarial networks (GAN) combine diffusion processes with GANs. In the field of speech synthesis, a recent diffusion speech synthesis model called DiffGAN-TTS, utilizing the structure of GANs, has been introduced and demonstrates superior performance in both speech quality and generation speed. In this paper, to further enhance the performance of DiffGAN-TTS, we propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data. Objective metrics such as structural similarity index measure (SSIM), mel-cepstral distortion (MCD), F0 root mean squared error (F0 RMSE), short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), as well as subjective metrics like mean opinion score (MOS), are used to evaluate the performance of the proposed model. The evaluation results show that the proposed model outperforms recent state-of-the-art models such as FastSpeech2 and DiffGAN-TTS in various metrics. Our implementation and audio samples are located on GitHub.

翻译：扩散模型能够通过概率方法生成高质量数据，但其生成速度受限于需要大量时间步的缺点。为解决这一局限，近期提出的模型（如去噪扩散隐式模型DDIM）专注于在不直接建模概率分布的情况下生成样本，而去噪扩散生成对抗网络（GAN）等模型则将扩散过程与GAN相结合。在语音合成领域，最新提出的基于GAN结构的扩散语音合成模型DiffGAN-TTS，在语音质量和生成速度方面均展现了卓越性能。本文为进一步提升DiffGAN-TTS性能，提出了一种包含双判别器的语音合成模型：扩散判别器用于学习逆过程的分布，而频谱判别器则用于学习生成数据的分布。采用结构相似性指数（SSIM）、梅尔倒谱失真（MCD）、基频均方根误差（F0 RMSE）、短时客观可懂度（STOI）、感知语音质量评估（PESQ）等客观指标，以及平均意见得分（MOS）等主观指标对模型性能进行评估。评估结果显示，该模型在多项指标上均优于FastSpeech2和DiffGAN-TTS等最新最优模型。我们的实现代码与音频样本已发布于GitHub。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/