Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.
翻译:基于生成对抗网络(GAN)的语音合成器因能以超实时速度合成高保真音频波形而受到广泛研究。然而,现有研究指出,大多数GAN在特征空间中无法获得区分真实与虚假数据的最优投影。文献研究表明,切片对抗网络(SAN)作为一种能寻找最优投影的改进型GAN训练框架,在图像生成任务中表现优异。本文探索了SAN在语音合成任务中的有效性。为此,我们提出一种改进最小二乘GAN的方案(该方案被多数GAN语音合成器采用),使其损失函数满足SAN的要求。实验证明,SAN能够通过少量修改提升包括BigVGAN在内的GAN语音合成器的性能。我们的代码开源在https://github.com/sony/bigvsan。