Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.
翻译:生成对抗网络(GAN)声码器因能合成比实时更快的高保真音频波形而受到广泛研究。然而,多数GAN在特征空间中无法获得区分真实与伪造数据的最优投影。已有文献表明,切片对抗网络(SAN)作为一种能寻找最优投影的改进GAN训练框架,在图像生成任务中效果显著。本文探讨了SAN在声码任务中的有效性。为此,我们提出一种改进方案,使大多数GAN声码器采用的"最小二乘GAN"的损失函数满足SAN的要求。实验证明,SAN仅需少量修改即可提升包括BigVGAN在内的GAN声码器性能。代码开源地址:https://github.com/sony/bigvsan。