A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that receives the augmentation state as input in addition to speech, thereby assessing the input speech according to the augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/.
翻译:基于生成对抗网络的声码器利用对抗性判别器进行训练,因其快速、轻量级且高质量的特性,广泛应用于语音合成。然而,这种数据驱动模型需要大量训练数据,导致数据采集成本高昂。这促使我们研究如何在有限数据下训练基于生成对抗网络的声码器。一种可行方案是扩充训练数据以避免过拟合。但标准判别器是无条件的,对数据增强引起的分布变化不敏感,因此增强后的语音(可能异常)可能被误判为真实语音。为解决该问题,我们提出增强条件判别器,它在接收语音的同时额外获取增强状态作为输入,从而根据增强状态评估输入语音,且不抑制原始非增强分布的学习。实验结果表明,在数据有限条件下,增强条件判别器能提升语音质量;在数据充足条件下,其语音质量与基线方法相当。音频样本详见 https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/。