In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability, a constraint less common in text-to-speech (TTS). This study proposes a new approach to address this data scarcity. We utilize an existing singing voice synthesizer for data augmentation and apply precise manual tuning to reduce unnatural voice synthesis. Our development of two extensive singing voice corpora, ACE-Opencpop and KiSing-v2, facilitates large-scale, multi-singer voice synthesis. Utilizing pre-trained models derived from these corpora, we achieve notable improvements in voice quality, evident in both in-domain and out-of-domain scenarios. The corpora, pre-trained models, and their related training recipes are publicly available at Muskits-ESPnet (https://github.com/espnet/espnet).
翻译:在歌声合成(SVS)中,从乐谱生成歌声面临数据可用性有限的挑战,这一限制在文本转语音(TTS)领域较为少见。本研究提出了一种新方法以应对数据稀缺问题。我们利用现有歌声合成器进行数据增强,并通过精确的人工调校减少不自然的语音合成。通过开发两个大规模歌声语料库——ACE-Opencpop与KiSing-v2——我们实现了大规模、多歌手的歌声合成。利用这些语料库预训练的模型,我们在域内和域外场景中均显著提升了语音质量。上述语料库、预训练模型及其相关训练方案已开源发布于Muskits-ESPnet(https://github.com/espnet/espnet)。