Lyrics-to-melody generation is an interesting and challenging topic in AI music research field. Due to the difficulty of learning the correlations between lyrics and melody, previous methods suffer from low generation quality and lack of controllability. Controllability of generative models enables human interaction with models to generate desired contents, which is especially important in music generation tasks towards human-centered AI that can facilitate musicians in creative activities. To address these issues, we propose a controllable lyrics-to-melody generation network, ConL2M, which is able to generate realistic melodies from lyrics in user-desired musical style. Our work contains three main novelties: 1) To model the dependencies of music attributes cross multiple sequences, inter-branch memory fusion (Memofu) is proposed to enable information flow between multi-branch stacked LSTM architecture; 2) Reference style embedding (RSE) is proposed to improve the quality of generation as well as control the musical style of generated melodies; 3) Sequence-level statistical loss (SeqLoss) is proposed to help the model learn sequence-level features of melodies given lyrics. Verified by evaluation metrics for music quality and controllability, initial study of controllable lyrics-to-melody generation shows better generation quality and the feasibility of interacting with users to generate the melodies in desired musical styles when given lyrics.
翻译:歌词到旋律生成是AI音乐研究领域中一个有趣且富有挑战性的课题。由于歌词与旋律之间关联学习的困难,现有方法通常生成质量较低且缺乏可控性。生成模型的可控性使人类能够与模型交互以生成所需内容,这在以人为中心的AI音乐生成任务中尤为重要,有助于辅助音乐家进行创意活动。为解决这些问题,我们提出了一种可控歌词到旋律生成网络ConL2M,它能够根据歌词生成用户所需音乐风格的逼真旋律。本研究包含三项主要创新:1) 为建模多序列间音乐属性的依赖关系,提出了跨分支记忆融合(Memofu)方法,实现多分支堆叠LSTM架构间的信息流动;2) 提出参考风格嵌入(RSE)以提升生成质量并控制生成旋律的音乐风格;3) 提出序列级统计损失(SeqLoss)帮助模型学习给定歌词下旋律的序列级特征。通过音乐质量与可控性评估指标验证,这项关于可控歌词到旋律生成的开创性研究展现了更优的生成质量,以及通过用户交互在给定歌词时生成期望音乐风格旋律的可行性。