Automatic song writing is a topic of significant practical interest. However, its research is largely hindered by the lack of training data due to copyright concerns and challenged by its creative nature. Most noticeably, prior works often fall short of modeling the cross-modal correlation between melody and lyrics due to limited parallel data, hence generating lyrics that are less singable. Existing works also lack effective mechanisms for content control, a much desired feature for democratizing song creation for people with limited music background. In this work, we propose to generate pleasantly listenable lyrics without training on melody-lyric aligned data. Instead, we design a hierarchical lyric generation framework that disentangles training (based purely on text) from inference (melody-guided text generation). At inference time, we leverage the crucial alignments between melody and lyrics and compile the given melody into constraints to guide the generation process. Evaluation results show that our model can generate high-quality lyrics that are more singable, intelligible, coherent, and in rhyme than strong baselines including those supervised on parallel data.
翻译:自动歌曲创作是一个具有显著实际意义的课题。然而,由于版权问题导致的训练数据匮乏以及其创造性本质的挑战,相关研究进展缓慢。尤其值得注意的是,以往的工作常因缺乏足够的平行数据而难以准确建模旋律与歌词之间的跨模态相关性,从而生成的歌词可唱性不足。现有工作也缺乏有效的内容控制机制,而这一功能对于降低音乐创作门槛、使没有深厚音乐背景的人群参与创作至关重要。本文提出一种无需在旋律-歌词对齐数据上训练即可生成悦耳歌词的方法。为此,我们设计了一个分层歌词生成框架,将训练(纯粹基于文本)与推理(旋律引导的文本生成)相分离。在推理阶段,我们利用旋律与歌词之间的关键对齐关系,将给定的旋律编译为约束条件以引导生成过程。评估结果表明,与包括基于并行数据训练的强基线模型在内的方法相比,我们的模型能够生成更高质量、更具可唱性、可理解性、连贯性且押韵的歌词。