In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatural voice synthesis. This innovative method has led to the creation of two expansive singing voice datasets, ACE-Opencpop and ACE-KiSing, which are instrumental for large-scale, multi-singer voice synthesis. Through thorough experimentation, we establish that these datasets not only serve as new benchmarks for SVS but also enhance SVS performance on other singing voice datasets when used as supplementary resources. The corpora, pre-trained models, and their related training recipes are publicly available at ESPnet-Muskits (\url{https://github.com/espnet/espnet})
翻译:在歌唱语音合成(SVS)领域,从乐谱生成歌唱语音因数据可用性有限而面临挑战。本研究提出了一种独特的策略以应对SVS中的数据稀缺问题。我们采用现有的歌唱语音合成器进行数据增强,并辅以精细的人工调校——这一在数据构建中尚未被探索的方法——以减少合成语音不自然的实例。这一创新方法催生了两个大规模的歌唱语音数据集:ACE-Opencpop与ACE-KiSing,它们对于大规模、多歌手语音合成至关重要。通过全面的实验,我们证实这些数据集不仅可作为SVS的新基准,而且在作为补充资源使用时,也能提升其他歌唱语音数据集上的SVS性能。相关语料库、预训练模型及其训练方案已在ESPnet-Muskits平台开源发布(\url{https://github.com/espnet/espnet})。