Generative adversarial network (GAN)-based neural vocoders have been widely used in audio synthesis tasks due to their high generation quality, efficient inference, and small computation footprint. However, it is still challenging to train a universal vocoder which can generalize well to out-of-domain (OOD) scenarios, such as unseen speaking styles, non-speech vocalization, singing, and musical pieces. In this work, we propose SnakeGAN, a GAN-based universal vocoder, which can synthesize high-fidelity audio in various OOD scenarios. SnakeGAN takes a coarse-grained signal generated by a differentiable digital signal processing (DDSP) model as prior knowledge, aiming at recovering high-fidelity waveform from a Mel-spectrogram. We introduce periodic nonlinearities through the Snake activation function and anti-aliased representation into the generator, which further brings the desired inductive bias for audio synthesis and significantly improves the extrapolation capacity for universal vocoding in unseen scenarios. To validate the effectiveness of our proposed method, we train SnakeGAN with only speech data and evaluate its performance for various OOD distributions with both subjective and objective metrics. Experimental results show that SnakeGAN significantly outperforms the compared approaches and can generate high-fidelity audio samples including unseen speakers with unseen styles, singing voices, instrumental pieces, and nonverbal vocalization.
翻译:基于生成对抗网络(GAN)的神经声码器因其生成质量高、推理效率高和计算开销小,已被广泛用于音频合成任务。然而,训练一个能够良好泛化到域外场景(如未见过的说话风格、非言语发声、歌唱和音乐片段)的通用声码器仍然具有挑战性。在这项工作中,我们提出SnakeGAN——一种基于GAN的通用声码器,能够在各种域外场景中合成高保真音频。SnakeGAN将以可微数字信号处理(DDSP)模型生成的粗粒度信号作为先验知识,旨在从梅尔频谱图中恢复高保真波形。我们通过Snake激活函数和抗混叠表示引入周期性非线性,进一步带来音频合成所需的归纳偏差,并显著提升通用声码在未见场景中的外推能力。为验证所提方法的有效性,我们仅使用语音数据训练SnakeGAN,并通过主观和客观指标评估其在多种域外分布上的性能。实验结果表明,SnakeGAN显著优于对比方法,能够生成高保真音频样本,包括未见说话者配合未见风格、歌声、乐器片段及非言语发声。