Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://stylesinger.github.io/.
翻译:域外(OOD)歌唱声音合成(SVS)中的风格迁移旨在利用参考歌唱样本中未见过的风格(如音色、情感、发音及咬字技巧)生成高质量的歌唱声音。然而,由于歌唱声音具有高度的表现力,建模其风格中的细微差异是一项艰巨的任务。此外,现有SVS方法基于目标声学属性在训练阶段可辨识的假设,在OOD场景下合成歌唱声音的质量会出现下降。为应对这些挑战,我们提出了StyleSinger,这是首个用于对域外参考歌唱样本进行零样本风格迁移的歌唱声音合成模型。StyleSinger包含两个提升效力的关键方法:1)残差风格适配器(RSA),它采用残差量化模块来捕捉歌唱声音中的多样化风格特征;2)不确定性建模层归一化(UMLN),通过在训练阶段扰动内容表示中的风格属性来提升模型泛化能力。我们在零样本风格迁移上的广泛评估明确证实,StyleSinger在音频质量和对参考歌唱样本的相似度上均优于基线模型。歌唱声音样本访问地址:https://stylesinger.github.io/。