Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network. According to both objective and subjective evaluation experiments, DiffGAN-VC has been shown to achieve high voice quality on non-parallel data sets. Compared with the CycleGAN-VC method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound quality.
翻译:语音转换是一种在保持语言信息完整性的同时实现说话风格转换的方法。众多研究者采用深度生成模型完成语音转换任务。生成对抗网络(GANs)能够快速生成高质量样本,但生成样本缺乏多样性。去噪扩散概率模型(DDPMs)生成的样本在模态覆盖率和样本多样性方面优于GANs,但DDPMs计算成本高且推理速度慢于GANs。为提升GANs和DDPMs的实用性,本文提出DiffGAN-VC——一种融合GANs与DDPMs的变体,用于实现非并行多对多语音转换(VC)。我们采用大步长实现去噪过程,并引入多模态条件生成对抗网络来建模去噪扩散生成对抗网络。客观与主观评估实验均表明,DiffGAN-VC在非并行数据集上实现了高语音质量。相较于CycleGAN-VC方法,DiffGAN-VC在说话人相似度、自然度和音质方面均表现更优。