Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. The code, weights, benchmark, and demos are publicly available at https://github.com/ASLP-lab/YingMusic-Singer.
翻译:在保持旋律一致性的同时用修改后的歌词重新生成歌声仍具挑战性,现有方法要么控制能力有限,要么需要耗时的手动对齐。我们提出YingMusic-Singer,一种完全基于扩散模型的歌声合成方法,支持旋律可控且歌词操作灵活。该模型接收三种输入:可选的音色参考、提供旋律的歌声片段以及修改后的歌词,无需手动对齐。通过课程学习和组相对策略优化进行训练,YingMusic-Singer在旋律保持和歌词遵循方面均优于Vevo2(当前最支持无需手动对齐的旋律控制的基线模型)。我们还提出LyricEditBench,首个用于评估旋律保持型歌词修改的标准基准。代码、权重、基准和演示样例已公开于https://github.com/ASLP-lab/YingMusic-Singer。