Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without requiring additional data or increasing model parameters. We innovatively introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance in both general SVC scenarios and particularly in cross-domain SVC scenarios.
翻译:基于扩散模型的歌声转换模型相较于传统方法已展现出更优的合成质量。然而,在跨域歌声转换场景中,当源声音域与目标声音域之间存在显著音高差异时,模型倾向于生成带有嘶哑声的音频,这对获得高质量人声输出构成了挑战。为此,本文提出一种用于歌声转换的自监督音高增强方法,该方法无需额外数据或增加模型参数即可提升歌声转换任务中的音质。我们创新性地将循环音高偏移训练策略与结构相似性指数损失函数引入歌声转换模型,有效提升了其性能。在公开歌声数据集M4Singer上的实验结果表明,所提方法在通用歌声转换场景,尤其是在跨域歌声转换场景中,均能显著提升模型性能。