Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator's task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.
翻译:生成对抗网络(GAN)模型能够合成高质量的音频信号,同时保证快速的样本生成。然而,它们难以训练,且容易出现模式崩溃、发散等问题。本文提出了SpecDiff-GAN,一种基于HiFi-GAN的神经声码器,HiFi-GAN最初被设计用于从梅尔频谱图合成语音。在我们的模型中,通过前向扩散过程增强训练稳定性,该过程包括在将真实样本和伪造样本输入鉴别器之前,向两者注入来自高斯分布的噪声。我们进一步利用谱形噪声分布改进模型,旨在使鉴别器的任务更具挑战性。随后,我们在多个数据集上展示了所提模型在语音和音乐合成方面的优势。实验证实,与多种基线方法相比,我们的模型在音频质量与效率方面均表现更优。