A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.
翻译:歌曲是歌声与伴奏的结合。然而,现有研究将歌声合成与音乐生成作为独立任务进行探索,鲜少关注歌曲合成。本文提出一项名为"文本到歌曲合成"的新任务,该任务同时涵盖歌声与伴奏的生成。我们开发了Melodist——一种两阶段文本到歌曲方法,包含歌声合成(SVS)与歌声到伴奏(V2A)合成两个阶段。Melodist通过三塔对比预训练学习更有效的文本表示,以实现可控的V2A合成。为缓解数据匮乏问题,我们从音乐网站挖掘构建了中文歌曲数据集。在该数据集上的评估结果表明,Melodist能够合成具有相当质量与风格一致性的歌曲。音频样本参见https://text2songMelodist.github.io/Sample/。