The diffusion model is capable of generating high-quality data through a probabilistic approach. However, it suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. To address this limitation, recent models such as denoising diffusion implicit models (DDIM) focus on generating samples without directly modeling the probability distribution, while models like denoising diffusion generative adversarial networks (GAN) combine diffusion processes with GANs. In the field of speech synthesis, a recent diffusion speech synthesis model called DiffGAN-TTS, utilizing the structure of GANs, has been introduced and demonstrates superior performance in both speech quality and generation speed. In this paper, to further enhance the performance of DiffGAN-TTS, we propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data. Objective metrics such as structural similarity index measure (SSIM), mel-cepstral distortion (MCD), F0 root mean squared error (F0 RMSE), short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), as well as subjective metrics like mean opinion score (MOS), are used to evaluate the performance of the proposed model. The evaluation results show that the proposed model outperforms recent state-of-the-art models such as FastSpeech2 and DiffGAN-TTS in various metrics. Our implementation and audio samples are located on GitHub.
翻译:扩散模型能够通过概率方法生成高质量数据,但其生成速度受限于需要大量时间步的缺点。为解决这一局限,近期提出的模型(如去噪扩散隐式模型DDIM)专注于在不直接建模概率分布的情况下生成样本,而去噪扩散生成对抗网络(GAN)等模型则将扩散过程与GAN相结合。在语音合成领域,最新提出的基于GAN结构的扩散语音合成模型DiffGAN-TTS,在语音质量和生成速度方面均展现了卓越性能。本文为进一步提升DiffGAN-TTS性能,提出了一种包含双判别器的语音合成模型:扩散判别器用于学习逆过程的分布,而频谱判别器则用于学习生成数据的分布。采用结构相似性指数(SSIM)、梅尔倒谱失真(MCD)、基频均方根误差(F0 RMSE)、短时客观可懂度(STOI)、感知语音质量评估(PESQ)等客观指标,以及平均意见得分(MOS)等主观指标对模型性能进行评估。评估结果显示,该模型在多项指标上均优于FastSpeech2和DiffGAN-TTS等最新最优模型。我们的实现代码与音频样本已发布于GitHub。