Singing voice synthesis (SVS) has seen remarkable advancements in recent years. However, compared to speech and general audio data, publicly available singing datasets remain limited. In practice, this data scarcity often leads to performance degradation in long-tail scenarios, such as imbalanced pitch distributions or rare singing styles. To mitigate these challenges, we propose uncertainty-based optimization to improve the training process of end-to-end SVS models. First, we introduce differentiable data augmentation in the adversarial training, which operates in a sample-wise manner to increase the prior uncertainty. Second, we incorporate a frame-level uncertainty prediction module that estimates the posterior uncertainty, enabling the model to allocate more learning capacity to low-confidence segments. Empirical results on the Opencpop and Ofuton-P, across Chinese and Japanese, demonstrate that our approach improves performance in various perspectives.
翻译:近年来,歌唱声音合成领域取得了显著进展。然而,与语音及通用音频数据相比,公开可用的歌唱数据集仍然有限。实践中,这种数据稀缺性常导致在长尾场景下性能下降,例如音高分布不平衡或罕见歌唱风格。为缓解这些挑战,我们提出基于不确定性的优化方法以改进端到端歌唱声音合成模型的训练过程。首先,我们在对抗训练中引入可微分数据增强,以样本级操作增加先验不确定性。其次,我们整合了帧级不确定性预测模块,用于估计后验不确定性,使模型能将更多学习能力分配到低置信度片段。在Opencpop和Ofuton-P数据集上(涵盖中文与日文)的实证结果表明,我们的方法在多个维度上提升了性能。