Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate F0 with the target voice style. Subsequently, the generated F0 is fed to DiffVoice to convert the speech with a target voice style. Furthermore, using the source-filter encoder, we disentangle the speech and use the converted Mel-spectrogram as a data-driven prior in DiffVoice to improve the voice style transfer capacity. Finally, by using the masked prior in diffusion models, our model can improve the speaker adaptation quality. Experimental results verify the superiority of our model in pitch generation and voice style transfer performance, and our model also achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.
翻译:摘要:尽管语音转换系统在迁移语音风格方面展现出卓越能力,但现有方法仍存在基频不精确与说话人自适应质量低下的问题。为应对这些挑战,我们提出Diff-HierVC——一种基于两个扩散模型的分层语音转换系统。首先引入DiffPitch,该模型能有效生成具有目标语音风格的基频F0;随后,生成的F0被输入DiffVoice,以转换出具有目标语音风格的语音。此外,通过源-滤波器编码器,我们对语音进行解耦,并将转换后的梅尔频谱图作为DiffVoice中的数据驱动先验,以增强语音风格迁移能力。最终,通过在扩散模型中采用掩蔽先验,我们的模型能改善说话人自适应质量。实验验证了本模型在基频生成与语音风格迁移性能上的优越性,且零样本语音转换场景下词错误率为0.83%、等错误率为3.29%。