We propose Serenade, a novel framework for the singing style conversion (SSC) task. Although singer identity conversion has made great strides in the previous years, converting the singing style of a singer has been an unexplored research area. We find three main challenges in SSC: modeling the target style, disentangling source style, and retaining the source melody. To model the target singing style, we use an audio infilling task by predicting a masked segment of the target mel-spectrogram with a flow-matching model using the complement of the masked target mel-spectrogram along with disentangled acoustic features. On the other hand, to disentangle the source singing style, we use a cyclic training approach, where we use synthetic converted samples as source inputs and reconstruct the original source mel-spectrogram as a target. Finally, to retain the source melody better, we investigate a post-processing module using a source-filter-based vocoder and resynthesize the converted waveforms using the original F0 patterns. Our results showed that the Serenade framework can handle generalized SSC tasks with the best overall similarity score, especially in modeling breathy and mixed singing styles. Moreover, although resynthesizing with the original F0 patterns alleviated out-of-tune singing and improved naturalness, we found a slight tradeoff in similarity due to not changing the F0 patterns into the target style.
翻译:我们提出Serenade,一种用于歌唱风格转换(SSC)任务的新型框架。尽管歌手身份转换在过去几年已取得重大进展,但转换歌手的演唱风格仍是一个尚未探索的研究领域。我们发现SSC存在三个主要挑战:目标风格建模、源风格解耦以及源旋律保留。为建模目标演唱风格,我们采用音频填充任务,通过流匹配模型预测目标梅尔频谱图的掩码片段,该模型使用掩码后剩余的目标梅尔频谱图配合解耦的声学特征。另一方面,为实现源演唱风格解耦,我们采用循环训练方法,将合成转换样本作为源输入,并以重建原始源梅尔频谱图为目标。最后,为更好地保留源旋律,我们研究了基于源-滤波器的声码器后处理模块,利用原始基频(F0)模式重新合成转换后的波形。实验结果表明,Serenade框架能够处理广义SSC任务,获得最佳整体相似度评分,尤其在建模气声与混合演唱风格方面表现突出。此外,尽管使用原始F0模式重新合成缓解了跑调问题并提升了自然度,但由于未将F0模式转换为目标风格,我们在相似度方面发现了轻微的权衡效应。